[March Week 1] System engineering problem

You work in a team that deploys IOT devices around the world to do some data collection. These device have very limited hard disk, memory and cpu capacity. You have about 5 million devices deployed, and from time to time you deploy new patch updates to the software running on these devices as much as twice a day. Your only maintain a single backend machine that collects the data from all 5 million devices. Your single backend machine has limited network and processing capacity. And from time to time, some of these devices go off line for as much as a month. The data they collect is also sensitive in the sense that, they need to be reported to the backend within seconds for additional processing otherwise the data becomes useless.

What are some of the issues that could come up with such a system, and how are you going to design around it.

  1. One backend machine is a bottle neck and single point of failure. If that machine goes done it means the IOT devices will have no where to send their data. It would be better to have multiple backend machines with a couple of proxies in front of them. That are distributing work to the backend. These proxies when checking for load and the backend and it’s liveliness.

  2. Since the data being sent from the IOT devices are time sensitive, it will be a good idea to have several of them deployed (for example 3) per location, So that during the update process a rolling update strategy would be used. This will prevent lost of data for a particular location, as other devices will be reading and sending data to the backend once one is being upgrade.

  3. To buffer load on backend a message queue can be used to receive messages to help buffer the work that the a backend server has to perform at a specific time

  1. obviously, a single backend is a problem constraint not a point of the solution.
  2. having 3 iot devices at a single point, increase traffic to the backend by 3, cost by 3 and deploy 10 million devices is a logistical nightmare. the point is how do we design a system with existing infrastructure.
  3. What problem does having a queue solve. this still still has the issue of the ttl. Cos if a message sits in a queue for more than the allotted seconds, it is still becomes useless.
  1. For the cost increase i agree. The deployment to the IOT devices should be automated.
  2. Well it’s better to have a delayed message than none at all. At least it would show you by how much you backend is following back in processing the messages sent in by the devices

How would we automate the deployment?

of what use is the is this info?

This will be useful for capacity planning.

We can use an ansible playbook for that or ?

capacity planning for what? how will that help with the current problem at hand?

an IOT is a hardware device. Can we deploy a hardware device using an ansible playbook? Unless i misunderstood you.

Sorry i think i am misunderstanding you. “from time to time you deploy new patch updates to the software running on these devices as much as twice a day” I thought this means the software running on the hardware. I was referring to the software on the IOT devices not the devices themselves

Got you we are only interested in the current problem

how will we deploy to 5 million devices using an ansible playbook?