Best Practices: How to Analyze IoT Sensor Data with InfluxDB
Session date: Sep 13, 2022 08:00am (Pacific Time)
InfluxDB is the purpose-built time series platform. Its high ingest capability makes it perfect for collecting, storing and analyzing time-stamped data from sensors — down to the nanosecond. The InfluxDB platform has everything developers need: the data collection agent, the database, visualization tools, and data querying and scripting language. Join this webinar as Brian Gilmore provides a product overview; he will also deep-dive with some helpful tips and tricks. Stick around for a live demo and Q&A time.
Join this webinar as Brian Gilmore dives into:
- The basics of time series data and applications
- A platform overview — learn about InfluxDB, Telegraf, and Flux
- InfluxDB use case examples — start collecting data at the edge and use your preferred IoT protocol (i.e. MQTT)
Watch the Webinar
Watch the webinar “Best Practices: How to Analyze IoT Sensor Data with InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Best Practices: How to Analyze IoT Sensor Data with InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
- Caitlin Croft: Senior Manager, Customer and Community Marketing, InfluxData
- Brian Gilmore: Director of IoT and Emerging Technology, InfluxData
Caitlin Croft: 00:00:00.258 Hello everyone and welcome to today’s webinar. My name is Caitlin and I’m joined today by Brian Gilmore, who will be talking about best practices and how to analyze IoT sensor data with InfluxDB. This session is being recorded and will be made available tomorrow. Please put any questions you may have for Brian in the Q&A or the chat. I’ll be monitoring both. Don’t be shy. Brian loves chatting with you guys. So any of your questions he’ll be more than happy to answer. And I just want to remind everyone to please be kind and respectful to all attendees and speakers. We want to make sure this is a fun, safe, happy place for our InfluxDB community. And without further ado, I’m going to hand things off to Brian.
Brian Gilmore: 00:00:48.625 Awesome. Thanks, Caitlin. Yeah. And thanks to everybody for joining. I really appreciate y’all coming in from all around the world. And today we’re going to talk a little bit about best practices for IoT sensor data. And I think the simplest way to sort of frame that is, using InfluxDB, in my experience, is just the best practice to begin with. I was talking to Caitlin earlier and pulling some of this stuff together for especially the demo part of this. I really wish I had InfluxDB, 10 or 12 years ago. My background — I came from the industrial automation space, and then I worked for a company called Splunk for eight years, helping them sort of solve the logs and unstructured alarm data issues in IoT. And now I’ve been happy to join InfluxData and have been working with the InfluxDB team for a little over a year now, just sort of doing some of the similar things but now for, of course, the highly dynamic world of metrics. And as you guys see, some of the exciting stuff coming out around the InfluxDays talks. I think you’ll see we’re going to we’re going to go a little bit past that pretty soon here. So you should be seeing some exciting stuff there.
Brian Gilmore: 00:01:59.076 So quick agenda. And we do have a lot to cover today, so Caitlin’s going to keep me honest and keep me on track. I do want to get a demo in at the end. I want to show you guys ideally how really easy it is for this MQTT thing that we launched recently. But first, we’ll just talk sort of about time series data applications overall. This is really not specific to IoT, but it is absolutely applicable to IoT. Again, the platform itself is just such a good fit. We’ll talk about that platform of the database, the agents, and then, of course, our scripting language of Flux. And then we’ll talk about sensor data at the edge and then also sensor data in the cloud through our new MQTT Native Collector. And I think what we really are starting to see there is a pretty cool sort of vertical stack emerging where you can have InfluxDB sort of staged throughout your edges and your data centers and your clouds. And we’re getting to the point that you can kind of unify them all in a way that’s going to be very compelling for the solutions that you guys are building for your own stuff or for your employers or if you’re starting a company related in IoT. The tech is really pretty well established for you to not make any of the false starts and mistakes that I had to make 10, 12 years ago. So excited to be here.
Brian Gilmore: 00:03:25.284 Like I said, I started out in the industrial data space. I worked in smart buildings for a while. Before that, I was an Aquarius, but largely because of the cool technical stuff I got to do there with the automation and control systems. I’ve been passionate my entire career about using data and data platforms to inform everybody. And I know that sort of picked up an interesting categorization of the democratization of data. And I think that can be a little bit of a loaded term. But generally, I think everybody should have access to the information they need to do their jobs effectively and safely. I’m super passionate about diversity and inclusion if you ever want to chat about that or any of the concepts, issues there. I’m not a traditional tech worker coming in from that industrial side of the house. And it is a truly meritocracy. And for anybody who’s looking to make the leap from a more front-line applications perspective to a software vendor, I’m always happy to help you sort of navigate some of that, so just let me know.
Brian Gilmore: 00:04:33.716 So overall, why does anybody build a time series application? And I think the times series application term is actually something that started probably coming out about five or six years ago, but it’s really starting to come to the mainstream now. But I think it’s largely about accessing information about the operations of assets that you do not have eyes on all the time. Those assets can be anything from robotics to sensors, automation and control systems, data systems in oil and gas, packaged manufacturing pallets, and — excuse me — in manufacturing, plant floor equipment, all of that. You can’t be everywhere. Although having sort of a network and operations in place to actively collect data and put it in front of you to people who need to know what’s going on in as real-time a manner as possible is super important. But visibility doesn’t always do it, right? I mean, it’s nice to know that 15 minutes before the thing blew up you saw a spike in temperature. But it would be much better to be able to understand what the root cause of that was. Do some forensics to understand was there anything that you could prevent? And ultimately, use that data in as sort of historical a manner as possible to really understand how can you improve the performance of whatever it is. Whether it’s a server farm or microservices or a fleet of robotics. And then also making sure that the uptime and availability of that technology is always there. And of course, it’s all being deployed, both secure from a cyber perspective, but then also secure and safe from a human’s perspective. And then, of course, what you want to be able to do is use all that data in the past to, maybe, take action. To alert people, to predict, to understand the impact of operations on revenue streams or customer satisfaction, or whatever it might be. And if you can sort of build a platform that can work all the way down there, that raw collection sort of edge, and then move that data all the way through so that it’s in just in time or real-time or whatever you want to call it in the hands of the people who need that information to improve the business, that’s sort of the holy grail of applications, and that’s what we’re working together as a team to build.
Brian Gilmore: 00:07:09.673 Now, these applications are more than InfluxDB, let’s be honest, right? I mean, I think you have a tremendous number of assets, like I said, whether there’s servers or robots. You’re collecting data would usually via one or two manners, either pushing, like just literally telling the application on the device where it needs to push its data to. And of course, we’ve got options there, both with Telegraf and with our own APIs and client libraries, things like that. And then there’s also polling, which is setting up an agent like Telegraf, or as you’ll see here with our new subscriptions model, actually configuring InfluxDB to reach out and ask, “Hey, what’s the temperature? What’s the pressure? What’s the altitude — whatever it might be — right now?” And then once you get that data, it’s generally — I mean, I think most people here are familiar with line protocol. We’ll cover that here in a second in more detail. But generally, whether it’s converting it to line protocol or whether it’s breaking a multi-value event into individual lines or whatever it might be, there’s going to be some event processing you need to do there. You’re going to need to detect and recognize the information that you want to store in a time series manner. You might want to transform it in a way. Normalize decimal points. Do whatever it is you want to do. And then you also want to — you’re going to want to do some enrichment. Maybe capturing information from other systems to sort of attach to it as metadata. And then finally, you’re going to want to decide where you want to send it. Now, Telegraf is great at doing all of that. And for those of you who are familiar with the process or plug-ins, you can do basically all of that in a very cool, dynamic way. And then ultimately, whether you’re using Telegraf or any other sort of event processing framework before InfluxDB, then you get that data into the database. And you want to have access to both the real-time information, the nearest to now information, but then also that historical data.
Brian Gilmore: 00:09:16.682 Now all of that together, of course, you’re going to want to take some action. And so typically what’s going to happen is that information is going to be processed, and it’s going to be pushed back typically in a supervisory manner. I would hesitate to think about a data platform like InfluxDB or any of these other time series or otherwise as an actual piece of the automation framework. Because automation frameworks, I mean, they just work in a different manner. Real-time for them is different than real-time in the database world. And it’s also oftentimes life safety-critical. And you can’t get the same sort of fundamental reliability with a highly distributed platform like a time series database as you could by just doing basic ladder logic in a PlC and making decisions and making changes to open-and-closed valves on a millisecond basis. It’s just not there. So we like to think of it as sort of part of the supervisory control. Overall, how do you need to tweak the speeds, or how do you need to change the temperature over time so that the entire system operates more effectively? Oftentimes too folks are starting to work machine learning and all of that back into their workflows now. And so part of that automation is passing machine learning back down to the edge so that the devices there can use it to detect those anomalies in true real-time on that millisecond basis. And then you want to have a sort of layer that sort of surfaces all of that and allows other developers or use developers to plug in customer-facing apps. Maybe it’s an asset tracking app or maybe it’s a life safety app or maybe you have partners who have developed augmented reality or training programs that could plug in to actually use some of this data as well. And then of course you’re also going to have the stuff that you guys just want to build because Joe in manufacturing needs something that’s going to let him know on his mobile phone when the system’s done processing or whatever it might be. So while the database itself in InfluxDB largely would cover sort of that database block, the APIs, of course, and then some of the automation you can handle with tasks, there’s also going to be other applications that honestly you would probably want to include. And that’s why we’re working so hard in building a great IoT and industrial IoT ecosystem.
Brian Gilmore: 00:11:55.892 Now for the time series piece, for anybody who’s not familiar, this is really probably the most important and rapidly growing sort of segment of data today. And this is, of course, a sensor best practice or sensor data thing. I think sensors are the most obvious in terms of the category of time series data. If you think about a temperature sensor, you always see it — right now, you know it’s 70.9 degrees Fahrenheit. That’s fine. But what was it five minutes ago? And as soon as you start to keep track of those five minutes ago or one minute ago or one nanosecond ago, you’re going to start to see that sort of up and down line chart. That time series data. Every single one of those data points has a timestamp affiliated with it. InfluxDB is one of the very few platforms that actually supports nanoseconds there. And we do that because a lot of our scientific and high-performance computing —the aerospace companies that work with us, the supply systems, the systems that create the data and send it to InfluxDB —they use nanoseconds. So we also support nanoseconds because we don’t want to be in the business of lopping off important precision, which we’ll cover in a second. But it doesn’t stop with — it’s not just sensor data like temperature. It could be stock prices. We have a lot of companies that are using us in cryptocurrency or financial tech, which is all about tracking the prices of different assets and commodities and making that available to users through mobile apps. Or if you’re monitoring servers, maybe it’s [CPU?] and [Freeman?]. If you’re looking at a hospital and all of the little machines and things that are connected up to the hospital beds —heart rate, things like that — all of that is time series data. Logs and traces too, right? I mean, anything that has a timestamp, right? And while we’re sort of specialized in the top four bullet points there, again, I think keep your eyes out for some really cool stuff that’s going to coming out in InfluxDays to sort of help you guys support the logs and traces use cases again.
Brian Gilmore: 00:14:11.700 Now of time series data, there’s sort of three main categories. You kind of hear of these as the Big Three. And metrics is really where we lead the industry. I mean, this is numeric samples. Like I said, temperature over time. And InfluxDB is honestly the true technical and sort of adoption leader there. Now, events can oftentimes be like metrics. But instead of being regularly sampled, they’re sort of like an on-event. So if an alarm is set, what is the temperature when that alarm goes off? Or if there’s some action that’s being taken, like every time the trap snaps, what was the forces on the switch, or whatever it could be? And that’s also, of course, time series because you have to sort of encode both the value as well as the time of the activity that you were tracking. But generally, instead of looking like a very straight line or very normalized data points, it’s more [inaudible] because each point comes with sort of a random point in time. Again, oftentimes very important to keep down to the nanosecond. Events can also be tech stuff. It can be logs, like an alarm. If you think about all of the sort of information that’s passed machine to machine in the logging, the file systems, all of that, any time there’s a timestamp, any time there’s a piece of information you need to say it happened at specifically this point in time, that’s an event too.
Brian Gilmore: 00:15:49.794 Now traces are actually interesting. I think there’s a lot of debate about what is really a trace. I think people really look at traces as highly related to subservices, microservices, and services in the IT world. I think in the IoT and industrial, there’s a lot of tracing going on as well. We typically think of them and call them transactions, which is sort of either a loosely or a formally bundled set of metrics and events related to some specific activity. For example, if you’re thinking about launching a rocket, for example, you want to know every second of, of course, five, four, three, two, one, lift off. But then you want to know altitude on a regular basis. You want to know airspeed on a regular basis. And keeping those metrics going over time is really, really important. Now the events might be something like if you think about each stage. Liftoff. First stage separation. Second stage fire. All of those different steps that occur, those are occurring because the altitude was X or because some other thing happened. And to have those particular steps occur, those are events. And the reason you want to keep both is you do want to know the second stage fire is before the first stage separates, as I think we got to see yesterday, with a pretty exciting boom that occurred on an unmanned launch. But the traces themselves are usually about a particular ID. Like the ID could be yesterday’s launch, right? And then the evaluation of those metric events into specific conditions and then the sort of sequence of those conditions as a bundle. Now, what do you call those transactions, or what do you call those traces? Those pieces of information are sort of consolidated bundles of a particular action or step or event. Are very exciting. And again, we’re working really hard to be able to support those even more in the platform as we go forward.
Brian Gilmore: 00:17:58.253 Talk a little bit about timestamp precision. I want to leave some time for the demo, again, at the end. But I think that nanosecond precision is really important for a lot of you. I think for others it’s just kind of a novelty. But remember, when you’re talking about precision of timestamps, you want to maintain it, right? You don’t want to, through the processing and the transformation, movement of the data, to lop off trailing numbers. Now, if there’s zeros, it’s iffy. It’s okay. But if you have an actual significant digit anywhere within that and you just truncate to, say, turn nanoseconds into seconds or nanoseconds into milliseconds or whatever, you’re losing precision and you’re losing information. The reason you don’t want to do that is because eventually, you will want to effectively order those. And if one set has been truncated and one set hasn’t been truncated, it’s virtually impossible to say which happened before. If you lop them both to the second, you’re just going to get a straight line of, “All of these things happened somewhere in the last second.” And where did that occur? What was the sequence of events? Again, if you can’t understand the sequence of events, you might have the visibility. Your dashboards will all — the gauges will move nicely and in sync. But when you get into that sort of second phase of troubleshooting and forensics, as we talked about earlier, it’s rough because you don’t know that sequence of events. And therefore, you can’t do that forensics, that troubleshooting.
Brian Gilmore: 00:19:31.615 Granularity is another thing that we should consider, especially when we’re capturing and analyzing sensor data. And we won’t go so deep into the different theorems and all of that, but I think it’s important to consider it as sort of a loosely coupled component of precision. So when we are talking specifically about the metrics and we’re sampling, having a sample rate that allows you to effectively represent the waveform in the future is very important, right? Because you can very easily, with a sort of underthought sampling strategy, turn a waveform into a straight line, right? If you have a very periodic waveform and you’re sampling at exactly that period, you’re going to draw a straight line, right? So generally, the smart or the easy way to do it is if you have the capability — again, remember that InfluxDB lets you go all the way down to the nanosecond — you should be sampling at a minimum of twice the actual change. So that’ll at least give you a see-saw kind of thing where you might have had a nicely curved waveform. If you could do more, like if you can go to 10 [inaudible], as you can see here, or 6, or 2, or whatever, you’re going to have a much more accurate representation of the actual waveform which is going to — you’re sampling some data that hasn’t changed if you’re doing it right. But generally, you’re going to be able to see what did that curve look like? Or what did that waveform look like? Or what did that spike or that drop or that anomaly look like? And especially when you’re starting to get to stuff like analyzing all of the sensor data with machine learning algorithms and things like that, having that sort of granularity alongside that precision that we talked about a second ago is super important. There’s all kinds of really cool mechanisms right now for retaining data shape through post-processing. You can actually do some smoothing and curve fitting with functions like SDA and Holt-Winters and you can also, of course, always just retain all the data as fast as you can sample it. Some people actually do take that approach because it’s required of them from a compliance or a reason that’s dictated to them.
Brian Gilmore: 00:23:18.379 And then, of course, our Notebook interface is new, which allows you to actually — it’s not Jupyter Notebooks by any means, but it’s very similar in that you can sort of build like a really incremental step of let’s call the raw data. Let’s do a little bit of enrichment. Let’s look at it in its raw case. Let’s see how the different series align. Let’s look at that as a line chart. That kind of thing. It lets you take a much more incremental and iterative approach to understanding the data. And then the cool part is, is that you can like export that out very easily, right from the notebooks interface to any of our client libraries or directly to our APIs. So you use the interface to build the query to understand the data and then you give that to whatever other application needs to actually be able to use that data and it just all works seamlessly with tokens and everything like that. We do have a growing set of visualizations as well. So we have a good dashboarding library and you can do all that. Now all of that now sits on top of, as I said, this sort of like distributed edge data center and cloud model. Whereas on the edge, of course, you have all of these sensors. You’ve got application metrics. You’ve got microservices, events, and transactions or traces. You’ve got instrumentation of other technologies and systems. Generally, a lot of that’s coming in through Telegraf. We’re starting to see people actually installing edge nodes of InfluxDB directly on the machines or in the IoT gateways at the edge to give local persistence, which is really cool. If you want to send it to those local edge instances through the REST API or through the client libraries, feel free. Telegraf is also super powerful there because it gives you all that stuff. Like configuring the inputs, monitoring those inputs, doing your deadman checks, whatever you need to do. Timestamping, line breaking, white space handling. I mean, you guys can read the list. But it’s a really powerful agent for capturing data at the edge. But then if you stage it locally in those edge notes, you can process it, analyze it, and use it to actually do sort of smaller or shorter closed loop, lower latency closed-loop analytics just by deploying Flux right there and doing that work at the edge. And so when we talked about that supervisory control, you can do that there. Sometimes people will send it up to a more centralized InfluxDB environment like InfluxDB Enterprise in the run data center, or of course, something like pushing it up to InfluxDB Cloud.
Brian Gilmore: 00:25:58.555 Now, the cool part is, is that once — this isn’t isolated. You can do both. You can have data at the edge nodes. Those edge nodes can be using replication, which we’re going to talk about here in a second, pushing data up to the cloud. And then your data center and your cloud in InfluxDB can be using those same client libraries, those same REST APIs, Telegraf to capture data that’s native there where its point to origin is either in the data center or the cloud. And a lot of times we’re seeing folks bring that data in off of Kinesis or Kafka, the cloud service providers’ Pub/Sub mechanisms. I mean, not to talk about the demise of Google IoT, but we basically have two remaining there in AWS IoT and Azure IoT. And you can connect right up to those Pub/Sub mechanisms with either Telegraf connectors or we’ll talk a little bit about the MQTT native, which we’re calling Cloud Native here. And ultimately in InfluxDB in the area where your data is generated, you’re going to have a limited latency capability locally that you didn’t have before. You’ll have the ability to sort of, through having many, many InfluxDB as one, you can store data in a more granular or highly precise mechanism like we talked about. And then you can orchestrate data in between those instances using a number of mechanisms like our edge data replication, which we’ll cover off on, or you guys, you know about the 0.2 in Flux which actually lets you process data and then send the output of that local processing to another InfluxDB instance to store it in a bucket there. So imagine this sort of like highly distributed network of databases where they all are responsible for the local information, but then everything gets shared in a way that it can be consolidated and deployed through like mobile apps or things like that that have to come in through the cloud. It’s an incredibly powerful architecture. Again, I wish I had this 10 years ago.
Brian Gilmore: 00:28:00.419 Quickly talk about line protocol. I think this is the mechanism by which we format data to effectively and efficiently get it into the database. If you’re not familiar with it, there’s tons of documentation here. But essentially getting data to this format where you have a measurement. That’s the measurement name. That’s sort of the overall pre-post bucket, but overall sort of like primary key other than time. And then you’ve got a tag set, of course, which makes that data searchable later. You’ve got a field set, which are your numeric values, and then a timestamp, which should be and can be as precise as you can support. To get data into this format, you can either do it yourself explicitly in your applications using the client libraries and then Telegraf will do a lot of the heavy lifting here for you with some of its outputs and things like that. So if you have a Telegraf instance and you’ve got a good input plugin set there and a good processor plugin set there and a good output plugin set there, all of that line protocol conversion will be done by those different sets of plugins as compared to something that you have to do yourself.
Brian Gilmore: 00:29:14.440 Now, fields and tags, people have asked a lot about this. I just want to cover off quickly like is a field, what is a tag. So think about tags are labels, right? They’re not values. You should think of those as something that will disambiguate similar signals, right? So you might have 50 temperatures, but each one of those is affiliated with a particular machine. So a tag might be like the machine ID. You might also have a tag for the lab ID in this case. And then if the sample that it’s sending is related to a specific test, you might also add patient ID as a tag. Now, the things that you’re going to want to analyze in a time series manner, those are fields. They’re primarily numerical. There can be other values, of course. But they’re not indexed like tags are because they’re highly variable. And if we kept an index for every single one of them, we would have index blow and it just would not be as efficient as it could be. So a simple rule: think of tags as metadata and fields as the actual numerical value that you’re tracking over time.
Brian Gilmore: 00:30:18.741 To drill down a little bit on that — because I know sometimes people talk about cardinality — for me, I think the important thing about cardinality is, is that in no case do you want sort of runaway unbounded cardinality. Creating a new time series with every single event is just — there’s no practical use for that whatsoever. So when people talk about, “Oh, I had a 400 million cardinality database, and it was slow or whatever.” It’s like, “Yeah, well, that’s not —” I mean, we’re talking best practices here. That’s a worst practice, right? You want to consolidate your data into as few a time series as possible because that’s how it’s organized. And nobody wants to analyze unorganized data. If you think about sort of understanding cardinality, it really has to do with that combination of measurements. The sort of sensor ID or the other tags, right? And so if you’re not normalizing measurement names, tag names, and then the field keys, like the actual TTT temp HH, you’re going to have more cardinality than you need. So again, I would use all of the processing and the know-how that you can apply through your own code to normalize as much of that as possible just to make the whole thing more efficient. Now, if you find yourself in runaway cardinality situations, you can fix it, of course, later. But again, this idea of unbounded cardinality, unlimited cardinality, that’s great if you are effectively creating unlimited cardinality data sets. In all the time I’ve been working with data, I very rarely run into that. Almost all the time when there are cardinality problems, it’s because of a mistake in data hygiene. So just keep that in mind and try to get the data as clean and as sort of focused as possible before you put it out there.
Brian Gilmore: 00:35:09.180 Now, we talked a little bit about MapReduce before. There are secret weapons in Flux. I love Map. For those of you who are sort of looking to take those metrics and events and turn them into those traces or transactions, we have a really cool state duration capability which you basically define what the parameters of a state are —above 50, it’s hot, below 50, it’s cold —and then you can actually run state duration on that evaluation to actually understand how long a particular time series is in any one of those conditions. If you’re monitoring industrial or IoT devices, huge, huge, huge. I would use it daily if I was actually still doing this stuff in the field. And then all those custom functions, too. You can create your own functions. You can extend the existing Flux functions. It’s very powerful. Now, I will say Flux has got a learning curve. Like anything else, it’s going to take some time to really get to know it. But the people who have mastered it are doing wonderful things with it.
Brian Gilmore: 00:36:08.490 And the reason they take the time to get to know it is because it is so deeply integrated with InfluxDB. It is a unifying language now across 1.X., 2.X. And I think for those of you who have heard of IOx and know what’s coming next, it will, of course, pull that in as well. And then the other thing it does, for those of you who are actually developing applications is, think about if you had to run a function in your application where the raw data you required was in the terabytes. You don’t want to pull all of that data out into your application to run your own application logic on it in your app. Whatever it might be doing. Anomaly detection, machine learning application, or whatever. What Flux allows you to do is to actually stage a portion of the analytic that you need to do for your application at the database layer. So by implementing it in Flux and getting the amount of data down to the littlest or the smallest amount that you need to actually pass to your application through the API, you can ultimately distribute the compute that’s required to deliver your application. You can get most of it down to the database, which is extremely powerful. It will save you on data ingress and egress costs in the cloud. It’ll save you storage memory costs in your application. If you’re running serverless, you’ll run fewer bots. Whatever you’re going to do, I just think it’s a really good practice. And of course, language is very functional. And it’s getting better every day, which is exciting. I mean, I don’t think people really realize that with all Flux can do, it still — it’s not even a 1.0 release. It’s still pretty experimental.
Brian Gilmore: 00:37:55.257 So last two things before we hop on and do a quick demo. I wanted to talk about two features that I think we’ve released just in the past year, which are revolutionary for folks who are trying to analyze IoT and industrial IoT data. If we look at EDR, Influx Edge Data Replication, this is our first acronymed product. It’s very exciting. This came from a question of, “Why are we keeping people from running both open source and cloud?” Because we were presenting — because cloud was such a powerful solution, we were kind of presenting it as an alternative to OSS, And so we had customers out there who were building themselves — big oil and gas companies who were running OSS on, say, an oil rig and then passing that information up to their cloud account. But we didn’t have a way that everybody could do that. So a lot of it was already in there, but we exposed a mechanism by which you can have an InfluxDB OSS instance running as close to where the data is being created as you can get it, right? So if it’s in the electrical panel, put the computer in and install it right down the electrical panel if you need it. And you can collect and store that data locally in that database for local use if you need that high-precision data there for local use or just to have it stored somewhere in super high-precision a manner.
Brian Gilmore: 00:39:20.805 Now because with that OSS database you also get a Flux engine you can process that data using Flux tasks or whatever you want to do, Flux queries that are triggered by anomalies, whatever it might be. And those queries can take that data in near real time and put it into another bucket on that local instance. So you might have millions of events at high granularity running through that Flux query and now you’re down to hundreds or thousands of them. You put that into a bucket that you’ve configured to replicate to the cloud. And as soon as that data is put into the queue to go into that new bucket, it also gets put into this new queue we have, which is a durable disk-backed queue. It runs right there locally. It’s configured through the API to actually connect up to a remote in InfluxDB instance and the cloud. And as that data hits, it gets replicated to that queue. And as quickly as the network allows, that data will also get moved to a remote bucket in InfluxDB Cloud. So imagine a situation where you have high-granularity data being stored at the edge. You’ve got an anomaly detection algorithm that runs there at the edge. When an anomaly is detected, it takes the high granularity data, plus the anomaly puts it into another bucket, moves it up to the cloud. And then in the cloud that now high granularity data from around the anomaly and the anomaly itself is used to update and retrain your machine learning models, right? And then you can pass those machine learning models back down to the edge for application. And it’s like, that’s Holy Grail stuff. I mean, we’ve been dying to do that for a really long time. And now it’s all there for you. You just have to sort of use and take advantage of this new capability. And it really is only two API calls, right? Create a remote, which is the connection up to your cloud instance, with your token information, all of that, and then create your replication bucket. And boom, you’ll have intelligently and strategically orchestrated data moving around your network.
Brian Gilmore: 00:41:25.619 And the second one, and this is what I’m actually going to demo here — which I think is pretty exciting — provided that the demo gods all agree. We all love Telegraf. Don’t get me wrong. But there was a point where people were spinning up Telegraf in containers to connect or collect data from a cloud-based MQTT broker just to move that data into a cloud-based InfluxDB. That didn’t make sense. It was an extra step. It was unnecessary. So what we’ve done is we’ve created — and I think there’s going to be a lot of cool stuff that we can do with this mechanism in the future. We’ve created a new tier of capability in InfluxDB Cloud called Native Collector. And under the hood for those of you who are fans of the under-the-hood stuff, it’s NiFi. We’ve written our own NiFi service. It’s really, really cool. But what it allows you to do is, right through the API or right through the API or the UI of InfluxDB Cloud you can actually reach out to an InfluxDB broker or reach out to an MQTT broker. You scribe to topics. And as those topics are sort of populated with new data from whatever devices are downstream of that broker, that data will automatically get ingested in InfluxDB. Now, those of you who work with MQTT will go, “Well, wait a second. MQTT is completely unstructured. Do people need to send line protocol as a message over topic?” You can, and that’s one of the options. But we also now have because of that NiFi, we have a processing layer that allows you to use regular expressions or JSON path information to actually extract information from the very unstructured MQTT messages, pull out the timestamps, the measurement names, what you want to be tags, what you want to be fields — again, please remember all of that best practices from a tags and fields perspective — and ingest that data extremely quickly. And so now you have you’re born in cloud information is being stored in your InfluxDB Cloud using the edge data replication stuff. Your born on the edge is being stored on the edge but moved into the cloud as it’s strategically required. And this is a really, really, really powerful, highly distributed time series database model. And I think as you guys start to play with it and figure it out, you’ll like it. But what I want to do with the demo is I want to show you sort of how easy it is. And I’m going to stop sharing here quickly and I’m going to move over to sharing my screen here. Hold on. Caitlin, do you have any quick questions we can do while I’m —?
Caitlin Croft: 00:44:07.578 Yeah, there’s tons of questions here.
Brian Gilmore: 00:44:10.851 Oh, my goodness.
Caitlin Croft: 00:44:12.667 Can Telegraf access XML CVS files as data sources in an FTP server?
Brian Gilmore: 00:44:22.827 Well, that’s a good question. I think if there’s a — I think there’s probably an FTP plug-in. I would just search for that. And then the XML side shouldn’t be too difficult. Because even if there isn’t already a processor for parsing that XML, you should be able to just quickly write up a parser using a parsing plugin or a decoration plug-in to sort of iterate through that XML and use your own rules to grab it. I’ve never done that, but I wouldn’t see why not. And if it can’t do it, tell us, please. [laughter]
Caitlin Croft: 00:44:59.995 Yeah. We definitely always appreciate feedback from our community members and what they’re looking for. Because if you’re looking for it, I’m sure there’s someone else out there who’s also thought the same thing.
Brian Gilmore: 00:45:09.784 Exactly. Oh, God. All right. So I’m going to —
Caitlin Croft: 00:45:15.777 Are you ready to go?
Brian Gilmore: 00:45:17.253 Yeah. [crosstalk].
Caitlin Croft: 00:45:17.773 Okay. We’ll take the rest of the questions at the end. Don’t worry, we haven’t forgotten the rest of your questions.
Brian Gilmore: 00:45:23.237 And we’re doing okay on time?
Caitlin Croft: 00:45:24.984 Yep, we’re good.
Brian Gilmore: 00:45:26.754 All right. Cool. So I’m going to go back here. Let me extend it here. You guys can probably see now just my main slides, right?
Caitlin Croft: 00:45:34.762 Yes. And I can also see a terminal window in the back.
Brian Gilmore: 00:45:38.248 Okay. That’s awesome. All right. So let’s start here. So all I did was I created a new InfluxDB account, right? And you’ll see down here if you go — one of the things we had to do, because we are running a new service underneath that is not by any means free for us to run, and we know you guys want us to stick around as long as we do, we do require that you actually sign up for not a totally free cloud account. But we’ve started this thing with Cloud Credit. So I think almost every user now gets — when they sign up for that pay-as-you-go account, you get a $250 credit. That will be more than you need to try this feature out. So the first thing I would do is I would go in and I would create a bucket specifically for MQTT. I’ve already done that. And then what we want to do is we want to go and we want to go to the little data thing. And you’ll see down here there’s this new sort of Native Subscriptions button, right? So what this native subscriptions bucket does is it allows you to actually create, as we were talking about, a new subscription to a MQTT broker. Now right now, you’ll see here the only protocol that’s available is MQTT standby. Stay tuned. We’re going to come up with more there as we go. We have a couple of cool ones in the pipeline, the product pipeline, as well. But ultimately, you’re going to put your subscription name in. You can put a description in if you want. You’re going to select your protocol. You put in the hostname or the IP address of your accessible MQTT broker, which port you’re using, if it’s SSL or not, and then what your security is. And so for this demo, I’m just doing a — off of a free public broker from HiveMQ. I’ll show you how that works here in a second. But you can also do basic and you can even do certificate-based as soon as we release it. There’s a few details on that we’re still working out. But we wanted to get this in your hands now so that you could at least start using the basics of the capability.
Brian Gilmore: 00:47:45.288 But I already created one for this webinar — and we can take a look here — because I wanted to test it first. I didn’t want to leave that much trust to the demo gods. And you’ll see here that I’m actually connecting — oh, it’s really light. But I’m connecting up to the free broker at HiveMQ. Which if you guys are familiar with our partner HiveMQ, it’s awesome, especially for prototyping this stuff. I’m just connecting up to the non-SSL port of 1883. Oh, here’s a better place to see it. You see brokerhivemq.com:1883. A very simple connection there. And then I’ve created a new topic called InfluxDB_webinar. And for those of you who are super MQTT savvy, I’m sure you are frantically trying to figure out how to post data to it. I would thank you if you didn’t because we have to go all the way through this, but. And then I say, “Okay. Any of that data that comes in and wants to go to MQTT.” Here in terms of the data parsing, just to keep things simple and to appease those demo gods, I left it as line protocol. But again, remember, you can do regular expression and you can do JSON path. And I set the timestamp precision to nanoseconds just because I can. And it’s not that much data to begin with so may as well keep it as big as I can. And when you create that, you go through all that, you’ll see it’s running.
Brian Gilmore: 00:49:10.103 To go to the other side a little bit, we’ll look at HiveMQ, right? So HiveMQ, again, they have — this is the web socket client, but it connects to the same public server that this subscription is attached to. So I’m going to connect up to that server, please. Yes. Hey. Cool. And then, like we said, the topic was InfluxDB-Webinar. And we don’t need to worry about any of this. I’m going to add a subscription to that same one too, just so we can see it work here first. InfluxDB-Webinar, and then I’ll click subscribe. And so it’s got that going there. I’m connected. I’m pretty set on my topic here. Now, for my actual data, what I’m going to do is there’s a product I use a lot for demoing and things like that. If you guys aren’t familiar with Mockaroo and you’re building applications, it is very cool in that it lets you create highly realistic sample data. It also lets you create highly realistic mock APIs. And so I use this all the time for prototyping and things like that. It’s not a sales call or anything for these guys, but it’s totally free and you can do really cool stuff. So one of the things they have is once you create sort of these rules for creating the sample data, you can actually choose InfluxDB line protocol as an output. And the way I have this setup, I’m just going to take 100 rows and we’ll just do — what is today? Today is the 13th. Oh, sorry. So I’ll do the 12th to the 13th as my time range. Yeah. And then this format stuff doesn’t matter here because it is going to just output in nanosecond precision. And then, of course, if you wanted to do this all with curl and then use your own MQTT client to post it, you could do that. There’s also a public URL. But I’m just going to go ahead and I’m going to download this data. And now I have this new file of line protocol. Let me get rid of the one I was testing last night. And so this line protocol data is all in that sort of demo data model that I did.
Brian Gilmore: 00:51:27.642 So if I take a particular event like this one. I’m going to copy it out. I’m going to go back here and I’m going to go — I wish Zoom had a better way to keep your tabs. I’m going to go back to HiveMQ. I’m going to just paste that line protocol in the message. And fingers crossed. Drumroll. Boom. Okay. So I posted it to InfluxDB webinar. I’m subscribed to InfluxDB webinar here. And you’ll see that this is on — that message is from the subscription. So we went through. It passed. And if we go here now, we go back to our Data Explorer. If we go to MQTT — let’s just change the time zone. We’ll do today because it should be good. And do IoT. And we’ll just submit this and see what we’ve got. Let’s do past two days. Okay. Cool. So there it was. So you’ll see there’s a bunch of data points in here that are actually related to the data I sent yesterday. That one data point, I’m not sure where that went. But if I go back, this should work pretty well here if I go back to that file. I’m just going to — wait. Here. I’m going to just take it all. And I’m actually going to pass it all to that here. So I’m just going to delete you. I’m going to paste it all in. I’m going to publish. So again, line protocol. Because it’s the end of lines, you can publish it all on one block, which is great because the bite size of an actual MQTT topic is quite large. So you can pass quite a bit of data through one topic with one sample, as long as you maintain those line breaks. Yay. Thank you. Completely unstructured data format. You can do some really cool stuff there.
Brian Gilmore: 00:53:27.545 And again, if you take a look here, if we go back to just past 24 hours, hopefully, now — we click submit — we’ll see there’s the data that we passed just right there. And let me go to table so we can see the — oh, I don’t know. Well, the demo God’s partially — let me see if I can do it through the notebooks part. Let’s do this, and do — let’s just do past 24 hours again. Anybody who wants to troubleshoot me right here in line will be my huge hero. So there we go. Yeah, here we go now. So you can see we’ve got data from the last 24 hours as I created it. Many, many, many rows. The fields, the values, the tags. The measurements are all quite well done. And then, of course, I didn’t load a ton of — I did 100 samples for 100 device IDs, so we’re not going to see any lines. But it’s all in now. And the only thing we did was we posted that data to the MQTT broker and then we configured InfluxDB to subscribe to that broker, and now the data’s in. Just that simple. So I’m going to stop sharing or I’m going to stop that share. I’m going to go back to sharing my deck. And then Caitlin, I think we’re ready to move on.
Caitlin Croft: 00:55:09.727 Sounds good. Do you want to go through another question while you — oh, never mind.
Brian Gilmore: 00:55:20.079 Here we go.
Caitlin Croft: 00:55:20.919 Go for it.
Brian Gilmore: 00:55:21.820 Well, we don’t need any of this because the demo gods were friendly. But, I mean, if anybody wants a walk through this, there’s already one in blogs and stuff like that, so that you can try all of this yourself. But yeah, just let us know and we can send those instructions over if you want to check it out.
Caitlin Croft: 00:55:43.442 Thank you, everyone. And thank you, Brian. I think that was a fantastic overview and some good tips and tricks of using the platform. So I know there’s a few questions already in the queue, but if you guys have any other questions for Brian, please feel free to post them in the Q&A. I just want to remind everyone, as promised, InfluxDB U, it’s amazing. It’s amazing how many courses we already have, considering we only launched it about six months ago. So be sure to check out InfluxDB. It is completely free. So there’s tons of live and self-paced trainings on InfluxDB, Telegraf, Flux Kapacitor, and other stuff. So just go to influxdbu.com or you can scan the QR code. Brian, if you can go to the next slide? So it’s a really nice interface. It’s really easy to join. And you can check out the course catalog on the website. And once you finish courses, there are course completion certifications issued by Credly. So if you get some badges, be sure to share them on social media and use the #InfluxDB and we’ll find it and like it. And it just makes it really easy for us to find them.
Caitlin Croft: 00:57:05.955 And as promised, InfluxDays. So I’m sure some of you have already attended InfluxDays. It is our annual user conference. And this year it is on November 2nd and 3rd. It is completely free, so be sure to register. I’ve already seen the schedule, the agenda this year. There’s some amazing sessions and there are going to be some specific ones on IoT, so be sure to check it out. And in addition to the conference, we have a couple of different trainings. Brian, if you can go to the next slide? So on November 1st, we have Taming the Tiger training, which is our Telegraf and InfluxDB training. And Taming the Telegraf Tiger training will be virtual on Pacific time zone. The advanced Flux training. We’ve always offered Flux training, but now it’s actually the advanced Flux training. And that will be on November 8th and 9th in person in London. So it’s limited to 50 people. It’s £500. So be sure to register for it if you are interested. Kevin asked if it’s going to be virtual. The sessions themselves will be virtual, but we are going to have in-person watch parties. Like I mentioned, we want to get together with our community. Hang out with you guys. See what you guys have been up to in the last couple of years. So it’s definitely a hybrid event with some of the training virtual, some of it in person, so. And it’s completely free to join the conference. So really excited to see everyone there. All right. So let’s go into some of these questions. Can InfluxDB hold video snippets with the transactions?
Brian Gilmore: 00:59:06.822 It can. What you would want to do is you would want to serialize that data into some other location, like storage, or you could put in S3, or you could store it anywhere. And what I would recommend doing there is actually, as your time series, instead of actually storing the encoded video itself, which is not really something that you would want them in max or average over time, but you can — all of the metadata as fields. And then, of course, you could create pointers to those videos. So essentially what you would have is virtualized video stored in your time series. And then you could use any of the APIs or any other activity to actually — when time series results in a pointer to a particular video, then you could automatically render that video for the user for your own client. So I don’t think you would store the video in InfluxDB. But to your user, it wouldn’t make any difference.
Caitlin Croft: 01:00:06.728 Cool. Do you have any Flux examples of aligning and merging multiple data streams?
Brian Gilmore: 01:00:13.637 We do. We do. And I think we can point to some of those both in the awesome book, Caitlin, as well as on the Flux examples that are in the documentation. I don’t have any at the ready right now, though.
Caitlin Croft: 01:00:29.418 Okay. No problem. I’m just looking for — I think it’s just called Time to Awesome, isn’t it?
Brian Gilmore: 01:00:34.089 Yeah. It’s awesome.influxdata.com.
Caitlin Croft: 01:00:36.982 Okay. Okay. So I’ve thrown in the Zoom chat the link to what Brian is referring to, so be sure to check that out. Let’s see. Do you have any success stories with Modbus RTU data source? I do know that we have some customers who use Modbus. I’ll be honest, I can’t think of any right off the top of my head, Kevin, but I’m happy to reach out to you afterwards and send you some links. Brian, can you think of any?
Brian Gilmore: 01:01:16.630 Yeah. I mean, RTU, I wouldn’t — I mean, I would just use any of the many Modbus gateways out there to get that up to Modbus TCP. And then, of course, because of the sort of hoops you need to jump through with the register addressing in RTU versus TCP, your data’s going to be in a much better situation to actually pull it using the Telegraf Modbus input. So I would use the gateway there, in that case, to get it off of RTU and onto Modbus TCP, and you should be good to go.
Caitlin Croft: 01:01:46.166 Okay, cool. Someone is asking to see the query plans. I think the best option is to look at the pricing page. Brian, do you have any information?
Brian Gilmore: 01:01:57.533 I think what he’s actually or what they’re actually referring to there is the query planner for InfluxDB. And I saw that, and I was like, “Oh, that’s not something I know and that’s something I want to know.” I don’t know of a way that we have any way to introspect or inspect the actual query plans. But that’s a really great question and we will look into that for you. I’m sure somebody knows it’s just not me. [laughter]
Caitlin Croft: 01:02:19.614 Okay. Cool. Well, I’m glad. I just assumed it was a pricing question. [laughter] Let’s see. So geolocation — latitude, longitude — are tags, right? How does InfluxDB work with geolocated data?
Brian Gilmore: 01:02:34.951 Sure. So that’s a both answer. So 9 times out of 10, geolocation data, especially for moderately mobile assets, yes, you should consider latitude and longitude as tags. It is a description of where that particular piece of information came from. And if it’s not moving with every sample, like a drone or something like that, then yeah, you would want to save them as tags. If latitude and longitude, number one, change with every single value — so if you have a drone flying around the world and taking temperatures — if those are changing a lot, latitude and longitude, you want to save them as fields. Because if you save them as tags, they will get indexed and then you will get into one of those cardinality issues, especially if you’re saving your latitudes down to seven significant digits. Now, InfluxDB working with geolocation data there is an awesome sort of geo-temporal library included. It allows you to actually map your geolocation — your latitude and longitudes — to one of the — it’s the Google, I think, S3 or whatever library so that you can actually do some aggregation by geoset. And it’s all there in the documentation. If you look at Flux, just look for geo libraries and geo-temporal and you’ll see all the documentation for how you actually can work with it in InfluxDB and in Flux.
Caitlin Croft: 01:04:02.148 Perfect. Let’s see. Quote, Flux is not a query language. When would you build queries in SQL instead? Is the performance in SQL comparable or better? If Flux —?
Brian Gilmore: 01:04:16.800 Yeah. And as soon as I said that. So what I really mean is that Flux is not only a query language, right? So the first part of Flux, of course, the part we would consider to be filters, it is much like a query language, right? You’re going out, you’re seeking data in buckets, and you’re pulling it into tables for reference, and then you’re going to process those. But after that, it is much more than a query language. In terms of where you would use SQL versus where you would use Flux, the SQL commands in Flux is where I would start. So if you have a SQL database, you have a pre-existing SQL view where you can just grab that data directly using a select from. Or if you want to create an ephemeral view in Flux, you can use the SQL command to pull the data out of that SQL database and into Influx or into Flux. And then you can also — you can use our joins or whatever to pull data in from your time series data in your Flux buckets. and now you have this. You’ve got a script that has both the relational data from the SQL database and the time series data together, and then you can work all that Flux magic to do your enrichment or your normalization or whatever. So it’s not really a performance thing. It’s like if you want to just run SQL against relational databases, I wouldn’t even bother with Flux. But if you’re going to do anything in terms of mashing it up or normalizing it or joining it with time series data, then use SQL to get it out of SQL, use Flux to get it out of the buckets, and then work your magic in Flux from there on.
Caitlin Croft: 01:05:54.370 Cool. All right. I know we’re completely over time. We’re going to take one more question. So thank you everyone for sticking with us. What are some best practices for handling sensor data via MQTT which needs to be pre-processed before storing it in InfluxDB?
Caitlin Croft: 01:07:38.793 Awesome. Thank you, Brian. Thank you, Brian. And thank you, everyone, for joining today’s webinar. Just want to remind everyone again, it has been recorded and will be made available for replay probably by tomorrow morning. Don’t be shy. If you have any other questions for Brian and you’ve forgot them or you think of them afterwards, please feel free to email me. I’m happy to put you in contact with him and get your questions answered. Definitely check out InfluxDB if you haven’t used it yet. InfluxDB Cloud is pretty amazing. Take advantage of those free credits and try out the new MQTT Native Collector. Anything else you want to add, Brian?
Brian Gilmore: 01:08:25.492 No. I mean, just thank you all for the questions. And it’s great to see questions where it’s clear that people are actually doing stuff. We’re getting way past that point where this IoT thing is sort of like pie in the sky, and it’s awesome to see people really considering these issues in a real and practical way. So thank you.
Caitlin Croft: 01:08:43.948 Thank you, everyone, and I hope you have a good day. Bye.
Director of IoT and Emerging Technology, InfluxData
Brian Gilmore is Director of IoT and Emerging Technologies at InfluxData, the creators of InfluxDB. He has focused the last decade of his career on working with organizations around the world to drive the unification of industrial and enterprise IoT with machine learning, cloud, and other truly transformational technology trends.