Choosing the Right IoT Database
The nature of IoT systems
Finding the right database for an Internet of Things (IoT) system can be difficult. They tend to be spread across both physical and digital areas, and IoT devices continuously output massive quantities of time series data. In this web page, we'll detail the architecture of an IoT application, some of the features that a database needs to have to be suited for IoT, then go through some of the most popular IoT databases.
Overview of a standard IoT architecture
In a typical IoT architecture, there are three main groups of components: the IoT devices themselves, sensors and actuators; the edge servers in the so-called "fog"; and the often-cloud-based data center. The sensors send their data to the closest edge servers, and those edge servers process — and perhaps also transform and analyze — that data before sending it to the data center for storage.
Because a database is often run in the fog and in the cloud, we'll go into more detail on why these are used for IoT, and the traits that this system requires in a database.
Cloud computing refers to delivering data over the internet to a distributed network of data centers. It encompasses many areas, including several types of hosted services (SaaS, PaaS, IaaS, etc). In an IoT application, it allows for increased scaling, decreased expense, and improved efficiency.
Cloud-computing-based IoT solutions scale faster because they do not require setting up physical servers when extra space is needed, so they can grow as quickly as needed while simultaneously using only as much space as is needed right now. In addition, these solutions are also less expensive: because they only require you to pay for the computer you use and it isn't necessary to incur the costs of operating a physical data center. They are also able to deliver resources more quickly to developers and users.
Fog computing, also known as edge computing, is the process of extending cloud computing to the network edge. It improves processing efficiency by allowing a significant amount of computing to happen at the edge.
Say that we have an Industrial Internet use case, with several different factories that each contain hundreds of connected machines. If an edge device reports something wrong, the alerting and handling should occur as soon as possible: it should be handled by the specific factory where the problem happened, not be sent over the internet to a central server first.
This is the main benefit of fog computing: it reduces the latency in making decisions by bringing processing closer to the IoT sensors that are producing the data.
Requirements for an IoT database
A database used for IoT needs to have certain characteristics to ensure it works properly on the edge as well as in the cloud.
Requirements for the edge servers
Say we have an IoT setup with a few sensors (let's call them 1, 2, and 3) streaming data to an edge server, with a batch of sensor data coming in every ten seconds. Now, sensors 1 and 2 output their data and it routes along to the server, but sensor 3 outputs its data and it gets caught in latency for 20 seconds. During that time, two more sets of data have been produced and sent over.
In order to handle this abrupt pileup of data, edge servers need to support extremely fast write operations. Otherwise, data will be lost any time that there is any significant latency in data transmission. Therefore, a database that runs on an IoT edge server needs a very high ingest rate, not only enough to collect the data in real time, but enough to do so even with some data arriving in bursts.
In addition to quick write times, edge servers also require fast reads and tools for analytics. In the majority of decent-sized applications, IoT data doesn't pass from the sensor all the way to the cloud for analytics. Rather, some of the transformation, classification, and aggregation is done at the edge. This allows for the edge itself to make decisions in real time.
Requirements for the cloud data center
The first requirement for the cloud data center is to collect the data coming in from the edge servers, transform that data further as necessary, and analyze it. In order to do this effectively, three things are needed: commands for analyses and computations, built-in downsampling, and an appropriate retention policy.
The database management system itself should have built-in analysis commands instead of delegating that task to a specialized system because the more different databases and tools are used, the higher the overhead in keeping the system operational.
Downsampling and a retention policy are necessary for the same reason: to make it easy to quickly query a long period of historical data. Automatic downsampling is necessary to ensure high-precision data is only kept for a short time while less precise data is kept around for a longer period to inform seasonality and other trends. Implementing a retention policy means data will be automatically deleted after a certain period, freeing up space for new data.
In addition to these, we will also need a visualization engine of some type to display the state of our IoT system, and the ability to publish and subscribe.
Popular IoT databases comparison
Because IoT data is fundamentally time series data, the requirements for an IoT database are very similar to those of any time series database. It needs to write data in real time, compress it, store it efficiently, downsample it as appropriate, and query it quickly.
There are a variety of databases commonly used for IoT: among them, some relational databases like PostgreSQL, and many noSQL databases like MongoDB, Cassandra, and InfluxDB, as well as specialized IoT solutions like Azure IoT. Which among these is the best for IoT?
SQL databases have the advantage of stability and history, but the disadvantage of not being built to handle Big Data, as almost all IoT data is. They can work well for small, personal IoT projects, but are not a great choice for a system where high performance is needed.
InfluxDB, MongoDB, and Cassandra are on a more even footing from a performance and features standpoint, but even so, there is a clear winner in terms of write throughput, query throughput, and on-disk compression: InfluxDB outperforms both databases by over 2x in data ingestion, over 2x in compression, and over 5x in query speed. For more information, download the InfluxDB white papers: InfluxDB vs MongoDB; InfluxDB vs Cassandra.