Modernizing Your Data Historian with InfluxDB
Session date: Jan 23, 2024 08:00am (Pacific Time)
A data historian is a type of software designed for capturing and storing time series data from industrial operations. They are often a key part of the Industrial IoT ecosystem, where numerous devices and systems generate continuous data streams. Data historians are ideal for industrial automation and process control, whereas a time series database (TSDB) handles any data with a time stamp. InfluxDB is the purpose-built time series database used to collect, analyze, and store metric, event, and tracing data. With InfluxDB 3.0, developers can ingest billions of data points per second with unlimited cardinality.
In this webinar, learn how to modernize your current data historian with InfluxDB. With InfluxDB, customers gain the flexibility that comes with a cloud-native solution—including the wider ecosystem with 300+ integrations.
During this live session, Ben Corbett will dive into:
- TSDB v.s. historian considerations
- InfluxDB 3.0 overview: Product overview and key features
- Live Demo: Architecture overview and tips/tricks
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Modernizing Your Data Historian with InfluxDB” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors. Speakers:
- Caitlin Croft: Director of Marketing, InfluxData
- Ben Corbett: Solutions Engineer, InfluxData
CAITLIN CROFT: 00:01
Hello and welcome to today’s webinar. My name is Caitlin Croft, and I’m joined today by Ben Corbett who’s part of the team here at InfluxData. And today we will be talking about how to modernize your data historian with InfluxDB. Any questions you may have in the Q&A at the bottom of your Zoom screen. The session is being recorded and will be made available later today, as well the slides. Without further ado, I’m going to hand things off to Ben.
BEN CORBETT: 00:31
Thanks so much, Katelyn, and yes, welcome to everyone who’s joining us from wherever you are. I think I’ve probably got one of the kindest time zones at the minute. It’s only 4:00 PM for me. So especially if you’re joining us outside of working hours, we really appreciate it, and hopefully, we can show you some fun stuff today. So today, I’m going to be doing something a little bit different from my sort of industrial IoT specialism, and we’re going to be zooming in a little bit more on how to modernize your data historian with InfluxDB. So, I think particularly towards the end of last year, a lot of the conferences and events that I was doing, it was just a theme that was becoming more and more popular among our customers, and we’re seeing more customers go along the journey towards thinking about adopting a time series database instead of their historian. And in this webinar, we’re going to be talking a little bit about that journey. So, I’ll intro myself really quickly. As Katelyn mentioned, my name’s Ben Corbett. I’m a solutions engineer at InfluxData. So, I’m part of the European team, and my responsibilities are to help customers to understand if InfluxDB is going to be a good fit for their use case. I come from a background in mechanical engineering and then was a kind of back-end software engineer developing IoT platforms. So, I worked a lot in manufacturing, energy, agriculture, construction, transport, and then kind of settled for about a year and a half in electric vehicles as well. And InfluxDB started to be quite central to a lot of the platforms that I was developing, so I came over to the dark side to join these guys. I think it was—my two-year anniversary was a few days ago, so just over two years ago.
BEN CORBETT: 02:19
So, this is what I’m looking for you guys to get out of today. So I really want to give you, if nothing else, these three things: a high-level understanding - we only have an hour - of the key differences between data historians, also known as process historians, and time series databases, so kind of taking the InfluxDB part out of it - let’s try and stay technology agnostic - the challenges facing historian customers - so what are the constant things that our customers raise to us, the issues they want to solve with their historian and why they end up adopting solutions like InfluxDB? - and therefore, what benefits a time series database and specifically InfluxDB could give you as applied to those use cases. I’m going to try and achieve that through this loose agenda. I predict it will immediately get derailed as is the way, but we’ll see how we go. So, we’ll focus most of our time on the historian versus time series database section, cover a little bit about how InfluxDB version three, the latest version of Influx, also gives some of those benefits and meets some of those requirements. We’ll dig into integrations and partners, and I’ve got a couple of different customer architectures and examples to share with you before we move on to the Q&A. So, strap in.
BEN CORBETT: 03:40
So firstly, we’ll just start off with the definitions, right? For those of you that aren’t aware, a data historian or process historian is basically a domain-specific database. It’s been designed for the industrial setting and it’s typically deployed on-prem because it’s kind of been designed from the ground up by the OT users, operational technology users for the OT network. So that’s why it’s typically deployed on-prem in that kind of OT network. So yeah, it’s been designed, it’s been around for a couple of decades, and it’s designed for collecting, storing, retrieving that high-frequency data. So, it’s been pretty much the widely adopted industrial IoT database or database management system that’s been around for a few decades, and as such, it sits really nicely in that space, right? Very, very domain-specific. On the other hand of that, we have time series databases, which are more general-purpose. So, the way I want you to think about it is just what I’ve highlighted there in pink. So, time series databases are very general purpose, and they can be applied to any payloads with a timestamp. So, one of the high-level things will just be that image that I have in the middle. Really, I want you to start thinking about this as a kind of end-to-end solution on the data historian side because it’s very domain-specific, being the buy-off-the-shelf option. And then on the right-hand side, we’ve got the build option, which is your kind of time series database.
BEN CORBETT: 05:12
So, let’s dig into data historian pros. So obviously, the main thing I mentioned is it’s domain-specific, right? These things have been built from the ground up for manufacturing environments, and as such, they’re really well integrated into OT kind of process and control systems, so they work very, very well with these industrial standards. And last but not least, a really good pro is that they are end-to-end solutions, right? So very rich in the UIs. I was working with a customer just today who showed me a new data historian software, which gives them the ability to dig into power control or power quality queries. So, they’ve really got a feature for every customer requirement in there, and they encompass that domain-specific ecosystem really nicely. Some of the disadvantages that we see with data historians, so these are just the ones that we hear from our customers, right? So, they’re not very configurable. They’re quite rigid, so they’re not flexible in terms of changing requirements. This is also based on legacy tech, legacy proprietary tech that was developed over the last couple of decades. What this creates, what this seems to create is a bit of a walled garden of this proprietary technology, which can be seen to hamper your ability to adapt, innovate, and grow. So now that you’ve got an additional set of users which are looking to take advantage of the data that are stored in data historians like data scientists, data engineers, and even business analysts, their ability to integrate with this legacy proprietary tech is really difficult or impossible. So, kind of extending the functionality or adapting and growing based on your changing use cases based on digital transformations is becoming kind of increasingly important and really difficult with this kind of legacy tech.
BEN CORBETT: 07:05
Therefore, it’s tricky to integrate them with the modern data ecosystem, right? As I mentioned before, they’re typically, not always, but closed systems deployed on-prem, and this can create organizational silos. So, this is usually the customers that we chat to, right? They’ve got a couple of different data historians deployed in their plants, and what they want to do is collate all of that data, have a single source of truth and a single access point for all of their business operations. Kind of KPIs reporting and analytics workloads, and you don’t have that with this kind of on-prem closed ecosystem. Last but not least, the kind of last two relate to each other, really, just because historians have been around for so long and they’re kind of the—I guess the other thing that is widely industry adopted, it can create a little bit of a vendor lock-in situation. So, we’ve seen some of our customers that come over to adopt InfluxDB in this space are primarily running away from that. It can create a really tricky power dynamic in the renewal process in particular, which leads me on to the last point, cost efficiency, but also the cost model suitability. So obviously, an end-to-end, off-the-shelf solution that’s all-encompassing is going to be more expensive than the cost-efficient build option. But the model suitability was one thing I just wanted to have a nod towards. It depends on what data historian or process historian you’re using. But in some instances, one of the models we’ve seen is that you pay per tag or you pay per OPC UA tag that you’re collecting. This can create a really tricky cost model because you essentially pay the same for decommissioned machines that you would pay for new machines. So, it creates a really tricky situation where you want to use this system over a long period of time.
BEN CORBETT: 09:04
So, time series databases, what are the differences? So, they’re a lot newer. It’s based on modern open technology and built with the cloud in mind. This supports, and we’ll talk about it over the next couple of slides, the digital transformation movement and the Industry 4.0 transition, right? So, what we want to give you is application versatility. What I mean by that is, as a general-purpose database which uses this open and widely used tech, you can reuse us for different use cases. We regularly see industrial IoT customers, not just using us to store raw data, but also process data for a variety of analytical needs or applications. So essentially, time series databases, as a general-purpose database, can adapt to meet those different applications. Development agility is a really key point. So, this basically means, now that you have access to the best-in-class services in the cloud or using tools, using the open protocols that are available with this tech, it really enables your team to adapt, grow and develop, and iterate a lot faster. A lot faster, I should say. The last point there is just easy integrations with ecosystem tools. So, these representing the build option are a database which sits within the ecosystem and, as such, time series databases, Influx included, need to play nicely in that ecosystem. APIs, connectors, widely adopted industrial protocols, and third-party tooling means that, basically, choose the tool that your team are most happy with and use that to integrate with Influx.
BEN CORBETT: 10:48
A flexible query language and advanced analytics capabilities, so not just Influx, but many, many time series databases support a variety of query methods and widely adopted ones. In the latest version of Influx, it’s SQL, for example, and that also gives you integrations with SQL-based tooling. Really, really easy to use and really easy to adapt for advanced analytical queries as well. Time series databases specialize in real-time processing. It’s not the processing of historic data or batches, although, you can do that as well. Let’s have a look at real-time live data, and let’s produce monitoring and actionable insights on data now. That is what they specialize in and one of the four pitfalls of data historians. They represent a cost-efficient model. The licensing model typically is based on compute or usage, and it’s commercially scalable, right, as the cluster scales. Scalability and storage efficiency, obviously, based on open tech. This means that you can easily scale vertically or horizontally. And some of the new versions as well, as you’ll see in the later slides, are based on Kubernetes architectures, so that scalability factors, they can scale to handle massive workloads.
BEN CORBETT: 12:09
Additional pros that are specific to InfluxDB is the scale option, right? So, with InfluxDB version 3, we’ve really pushed the upper bound on the kind of workloads that Influx can handle, tens of millions of values per second, and we’ve also added support for unlimited cardinality. So, you don’t have to be limited by the amount of tags that you’re going to be using in your data set and the metadata associated to your time series data. We can support it. We can support hybrid deployments. So, whether you want to deploy InfluxDB at the Edge to give you an Edge storage requirement or on-prem within your substation, your factory, your data center, or in the cloud, as that kind of single pane of glass, even all of these options, even for a single use case, have been deployed, so we support all of those. Edge Data Replication. One of the key pieces of functionality that historians did have was the ability to store forward. So, if you’ve got intermittent connectivity from the Edge to a centralized substation, you want the Edge to have some store-forward capabilities. And I’ll talk a little bit more later about how we’ve achieved that.
BEN CORBETT: 13:19
Obviously, dealing with lots of industrial protocols and machines, they’ve all got various schemas, right, weird and wonderful schemas. InfluxDB is really flexible when it comes to schemas. It’s schema on write. So that basically means you set up a table and you begin writing, and if you start writing in additional fields, you won’t get a reject. The table will be able to handle those schema changes as you continue to write to it. So really flexible for kind of an ecosystem of changing protocols and devices. Hot and cold storage tiers, new with the new version of Influx, but basically incentivizing customers to keep data in Influx for longer. You won’t have to move it out for commercial reasons now with cold storage. And of course, there’s a vast community and network of integrations. I think there’s almost 800,000 instances of InfluxDB open source out there in the wild, and even more instances of Telegraph out there. It’s got a really vibrant community around it, and as such, lots of partners and integrations which are building plugins for Influx. I find a new one every week. The cons, really simply put, it’s not domain-specific. This isn’t going to integrate really nicely with that OT network, and it is the build-versus-buy option, right? This will come with additional effort and a learning curve associated with building out your historian using InfluxDB. That basically means the bottom line is you need to leverage the ecosystem to fill in the gaps, to fill in that industry-specific capability. For example, functionality like a unified asset model, things like that, these are the things that historian usually has built-in as an end-to-end solution. But a time series database being the build option, you’ll need to leverage some of our recommendations in our partner ecosystem to be able to fill in those gaps. And that’s something that you should think about if you’re looking to make the switch.
BEN CORBETT: 15:17
So, I wanted to have a little bit of a nod towards digital transformations in general and Industry 4.0, and just to add another buzzword onto the slide presentation. So obviously, digital transformations are becoming more and more vital for a lot of our customers. Legacy historians, from our experience, have proved to be bottlenecks in that infrastructure. It’s just because they’re developed to play really nicely in the OT network and they’re using these proprietary formats that can really slow things down when you’re looking to use some of this best-in-class tech such as cloud computing, IoT, AI, and ML. They’re not designed to work with these kinds of tools and analytics solutions, not to mention closed systems, right? Closed systems that historians are, they’ve kind of been unable to adapt to these modern technological concepts that you can see on the left-hand side associated with Industry 4.0, and they’re often tied to vendors which are averse to this kind of innovation. Because of that closed nature, sites that do use these kinds of traditional process or data historians can become silos where data isn’t easily accessed or shared, much less with your organization and much less with third-party software and tools. So obviously, as digital transformations are going on, leaders in the space and every organization are looking towards this cutting-edge tech to improve its operational efficiency and all of the kind of value propositions that you see in the middle there. Let’s reduce wastage, let’s improve our supply chain, workplace safety, our sustainability, and energy management. How are we going to leverage this best-in-class tech to do that? And traditional data historians are kind of unable to meet these needs and new solutions are going to be required in the future.
BEN CORBETT: 17:16
There’s a little bit of a graph there that just shows about how we always like to recommend that improving your access to real-time data and your time to insights can move your cost and effort away from reactive break/fix maintenance, hopefully up to predictive, which is where the lower costs will be. This is one of the slides that I found recently and kind of adapted, and I wanted to just walk you through it. It’s one of our value propositions that we do for all of our sectors in InfluxDB. What we like to do is focus on the users, right? So here, in the top left, you have the typical users associated with traditional data historians, right? You’ve got all of your OT users, engineers, site managers, as well as some IT users, right, and software engineers. These are potentially the traditional users that would use a data historian. Now we’ve got these, kind of, new kids on the block. So, you’ve got data scientists, data engineers, and business analysts, which don’t come from the OT network. They’re typically from the centralized or maybe IT space. And how are we going to enable those people? So, these are the kind of challenges that we see across these users. You guys will all be familiar with those. So, you’ve got operational efficiency, always. We want to be able to have real-time data so we can action it now. Sustainability, energy management, quality control, cost efficiency, always on everyone’s mind. That data silos and connectivity point that can speed up innovation, always a key piece. The diversity of data sources in IoT continues to astound me. Every week a new gateway is on my radar that I need to be able to integrate with. So, we need to have a solution that has the ability to adapt to that constantly evolving ecosystem, not to mention higher resolution data, higher data volumes, because that’s what data scientists want when they’re producing their insights.
BEN CORBETT: 19:14
We go through an activity of kind of mapping those challenges to positive business outcomes and we map the capabilities of our products and platforms to those. And this is kind of the activity that we go through, but I won’t go through them all one by one. But you can kind of see here, Edge Data Replication is one of the ones in the middle there if I just circle it. And that basically allows you to have a single pane of glass analysis with a durable data sync, which really improves that connectivity, accessibility, and data silo challenge, just as an example. Cool. So, version 3, we’re going to dig into this a little bit. I’m not going to sugarcoat it. Version 3 is a new storage engine. It’s been developed using Rust, hence its internal project name IOs, standing for iron oxide. So, what you’ll get throughout this presentation, InfluxDB has kind of doubled down on the core database functionality, right? Solving the storage challenge. Solving the query performance challenge. Focusing on the scale challenge. Cardinality writes things like that, and also, how to play nicely in the ecosystem. So, we’ve got the different editions of InfluxDB here. So, you can see on the left-hand side we’ve got the fully managed editions. Cloud serverless is the elastically scalable SaaS platform. And It’s got a free tier, a really low entry point, and it’s kind of designed for those non-mission-critical use cases and kind of prototyping and hobbyists. And we’ve got cloud dedicated, which is its dedicated infrastructure sibling. So, this is designed for mission-critical production workloads, and as such, has additional security features and flexibility around where you deploy it. For example, extended list of cloud regions and different cloud providers.
BEN CORBETT: 21:02
On the right-hand side, we’ve got the self-hosted editions. So clustered being the sort of self-hosted version of cloud dedicated. You can deploy it on your infrastructure right in your own VPC or on your bare metal servers. And then InfluxDB Edge. Stay tuned in the news. InfluxDB Edge is basically what the evolution of open source is going to be, and it’s the single-node edition of InfluxDB, which will be self-hosted. This is the shopping list of new functionalities which I’m going to rattle through now. So, on the left-hand side, we’ve basically got the piece around scale and time series database variety. So, what we wanted to do was add support for unlimited cardinality. So InfluxDB wasn’t just the metrics database, right? No longer do our industrial IoT customers need to worry about how they’re indexing the data or if they’re storing error codes, events, traces, and not just metrics. So, we’ve added support for unlimited cardinality, so you don’t need different databases for different data. Database customers are obviously never going to be upset by faster queries. So, let’s focus on and improve query performance, but also support for different kinds of queries, as I said. Now we need to support data scientists, business analysts, data engineers. How do we give them access to vast amounts of data, those really heavy-duty analytical queries?
BEN CORBETT: 22:27
Solving the storage challenge. Yep. So, we’ve added a cold storage tier. It’s based on low-cost object storage. This gives you two benefits. One is advanced compression. We’re seeing four to six times better than open source, so less bytes on disk, which is always good. And then secondly, a better unit cost, which is associated only to our dedicated infrastructure additions. So obviously, the price per gigabyte on object storage is orders of magnitude cheaper than SSDs. So, what we’re trying to do is we stop customers from moving data out of Influx to just dropping it in object storage for commercial reasons, right? You can leave it in Influx and continue to get value out of it. The fourth one here is really about adding support for SQL. So, we’ve added, yeah, support for the SQL language, the Apache arrow implementation of SQL, and as such, integrations with SQL-based tools. Because InfluxDB version 3 has switched from a row-based database to a columnar-based database, we’ve been able to adapt InfluxQL to take advantage of the vast performance benefits associated to SQL. So that’s why InfluxQL there is for backwards compatibility with previous versions. And last but not least is just that constant thing I want to drum home, is the focus on standards. open standards, industrial standards, so we can play nicely in the ecosystem. I’m talking about Parquet as the persistence layer so we can integrate with data warehousing solutions. SQL as the query language, so you can integrate with SQL-based tooling. Let’s focus on these industry standards and contribute towards those projects heavily.
BEN CORBETT: 24:05
Just to drum home really quickly again, we can rattle through them. Unlimited cardinality. From an IoT perspective, this means your metadata and your tagging of the data, that you don’t have to have any limits on the variation of values within those columns. API integrations. Really a nod towards that development agility and interoperability point. Client libraries. Integrations with toolings and a best-in-class API will speed up your development and ability to adapt. Interoperability with data tools. This is one of my favorite features of InfluxDB version 3, so stay with me for a minute. So, this is a rudimentary data flow diagram of version 3. Three things to point out. The first one is the logos, right? So, what we’ve got here is industrial standards. Parquet, data scientists love it. You’ve got to give them a Parquet file a day, otherwise, they will go hungry as we all know. Apache Arrow is the in-memory format of Parquet, right? So really speeding up that translation of data from the cold tier to the hot tier, the cached hot tier, and data fusion being the kind of query engine which is designed to work with Arrow. Arrow, Data Fusion—these are open-source projects that we’ve contributed towards heavily and will continue to and seem to be being adopted by—seem to be industry standards, basically, adopted by lots of solutions in the space.
BEN CORBETT: 25:29
The second thing is just how hot and cold work together. I want you to think about the relationship between the hot and cold tier as much more dynamic. Basically, means as live data is written in, it will reside in the cache in Apache Arrow ready to be queried. And so that’s giving you really fast queries on live data, your monitoring use case your alerting use case. After it’s compacted down into Parquet, if you query the data within Parquet, it’s refreshed in the cache. It’s refreshed in Arrow, therefore, if you’re performing queries, you’re typically working on that data set and you’re going to want to query it a few more times, which is why that kind of hot tier works. Last but not least is this backdoor. So, what we’re able to do is give you guys read-only access to the Parquet files sat in object storage, which are only ever kind of 5 to 15 minutes behind the leading edge of data. So, this has been built in its first iteration for Apache Iceberg. So, we’ve got customers using solutions like Google BigQuery, AWS Athena, Snowflake, Dremio, Trino, they’re all able to now query data within Influx. So that should be able to remove data duplication costs and ETL costs associated with moving data out into those solutions so it can be used by your organization. We already covered this, right? So, there’s a hot tier focusing on the leading edge of data. So really, really giving you benefits for that real-time data that’s fresh.
BEN CORBETT: 26:59
The cold tier, giving you cost benefits, lowering storage costs for keeping data within Influx for longer. This is a really key point, and this is around basically your Influx doesn’t need to know where the data sits, you don’t need to know where the data sits. So, you just fire in your query, and whether all or some of your data is in cold storage, let Influx do the rest and work with the cache and it will respond to your data. There won’t be any cumbersome moving data around from hot to cold to be accessed by different solutions. It’s all automatic. Schema on right, right? Really important for industrial IoT use cases, that flexibility of schema improves your developer productivity is what we find. Edge Data Replication, so I mentioned I’d talk about this a little bit. This basically is a feature which allows a durable data sync from InfluxDB to another InfluxDB. It’s been designed for industrial IoT use cases in mind, and really focuses on customers that have an Edge or local storage requirement and have connectivity issues. So basically, what happens is you set up a replication stream, so it means that all of the data that lands in your Edge instance and in a specific bucket will be replicated to the destination bucket, in this instance a kind of global hub, in a manner that’s durable. So basically, that means if the connection’s poor and that data sync fails, that payload is buffered on disk, and then the buffer will be flushed when the connection is reestablished. So, this is really, really popular for customers who have that kind of substation or regional hub and need a global single pane of glass on top of it to hang off their reporting and analysis. So, check that out. Yeah, really, really cool feature.
BEN CORBETT: 28:51
Obviously can’t not talk about Telegraf. Telegraf, it’s becoming an industry standard, not just for InfluxDB and its hundreds of plugins which relate specifically to the historian kind of use case. So OPC UA is probably one of our most popular one for historian customers. But MQTT, Modbus, Siemens F7s, all of these plugins give you the ability to get data in Influx really simply. So, this slide is constantly updating. It took me ages to update it over the last couple of days just to collect all of the information. And I’m sure I’ve missed some out, but these are probably the headlines, right? So, this gives you an idea of our partnerships and integrations and where they sit in the stack. So, you can see at the bottom you’ve got all of your devices and assets, SCADA systems, PLCs, factory networks, devices, all of that. Associated middleware, which plays really nicely with InfluxDB. Obviously Telegraf. But to name a few, Kepware, HiveMQ, HiBy, Apache NiFi. InfluxDB obviously has the ability to integrate with those, and they have plugins for InfluxDB that can be deployed via Edge in your own data center or on-prem substation or in the cloud. And then obviously we have this suite of visualization kind of applications which can sit on top of InfluxDB, Grafana being probably the most popular one, but Seek is also a really popular one for our historian customers. Tableau, Factory, Clarify, Apache Superset, to name a few others.
BEN CORBETT: 30:22
On the right-hand side, we have our kind of more platform integrations. So, these are kind of solutions that have the ability to write to and query Influx. Maybe Influx forms the persistence layer of that solution, so it’s not just a case of it being the visualization layer or the middleware layer, so I’ve kind of bundled them up on the right-hand side. We’ve got IO-Base, the cloud-based historian developed by Terega. PTC ThingWorx, I’m sure a lot of you will be familiar with that. Ignition, Akenza, and then Bosch ctrlX as well, and many more that I’m sure I will have missed. I’m going to talk a little bit about Terega. See how I’m doing for time. Okay so far. So Terega is one of our customers turned partners. So Terega Solutions are the creators of digital solutions used to improve energy efficiencies and to address decarbonization challenges. They are the creators of IO-Base. Very, very relevant for this discussion. So, IO-Base is a cloud-based IoT historian powered by InfluxDB. Any InfluxDB. InfluxDB on-prem, InfluxDB in the cloud, it sits on top of it. Creating this cloud-based digital twin for their customers allows them to collect data from all of the production sites, view it in real-time from anywhere at any time, and also fills in the gaps of the key historian functionality that you don’t get with Influx that traditional historians have. The ability to run scheduled algorithms, unified asset model collection techniques, or process diagrams, all of that wonderful historian functionality.
BEN CORBETT: 31:56
So, their first customer was Terega, who has a network of over 5,000 kilometers of gas pipelines within France. They’re aiming to help France to attain carbon neutrality by 2050. And with that goal in mind, Terega created IO-Base to aid in that digital transformation effort of Terega’s kind of data ecosystem. When Terega kind of went through this journey, they looked at a lot of different historians within the market, which didn’t quite meet their requirements, maybe if it was even for cost reasons, let’s say. What they decided to do was basically build their own and then sell it onwards to other customers who are looking to replace historians. So, this is kind of one of the key problems that they found in Terega that they wanted to avoid. What they have is a closed system with data silos at each plant substation wherever the kind of historian sat. You then have all of these organizational work streams that need to access the data to work on it from all of the plants. And going that way against the firewall not only represents a security issue, but it creates data silos, right? You’ve got data consistency issues across all of these different solutions and bandwidth issues is extremely messy and doesn’t foster that kind of agile development and kind of high speed of innovation. So, what they wanted to do was create the cloud-based historian, which is basically that single source of truth. They have done it. I’ll talk a little bit about it on the next slide, using a really clever piece of hardware, which is the Indabox, which you can see really small there, which I’ll talk about on the next slide. But basically, through creating this single source of truth in a secure manner, they’ve been able to have access to the best-in-class solutions of the cloud and really speed up that kind of digital transformations effort, right?
BEN CORBETT: 33:56 Those operations and that agility, data access has been greatly simplified. So just a little nod towards Indabox. That’s what you can see in the bottom left of the screen there. So that is their patented data diode. This is a really clever piece of hardware which only allows the one-way flow of data. So, for traditional historian customers, which are really, let’s say, high on security, firewalls, one-way communication, this is a really secure manner to get that data out of your plant, and it integrates with all of the main industrial standards that you would be familiar with. This isn’t necessary in order to use IO-Base, so you can use Indabox on its own, IO-Base on its own altogether. But obviously, if you use both of these together, what you’re looking at is creating centralized data stores. So, you’ve got that master data in one location, you’ve not got data silos, ease of data sharing, and it will tick all your security boxes as well, and you can see how InfluxDB is kind of powering that platform.
BEN CORBETT: 35:05 The last two are anonymized architectures, but I wanted to do is just go through two of my customers personally within Europe that I work with. So, this is a FTSE 500 energy company that we work with, and so this is their architecture that we have set up for them. So, what they’re doing is using—you can see they’ve got a load of different SCADA systems within a single plant. Everything in this red box is on a gateway, basically, which is assigned to a particular plant, and they’re collecting via those protocols, leveraging Telegraf, and it’s going into a local InfluxDB. They’re then leveraging the EDR functionality, Edge Data Replication that I showed you before, to write into their self-hosted InfluxDB in their AWS VPC. So that’s replicating the data out of the plant in a manner that is durable and guaranteed and buffered. And on top of that, now the playground’s open, and they can use whichever integrations and tools that they want on top of Influx. I think they’ve chosen to work with Grafana, and they’ve also got a monitoring platform, which is where all of the events of the edge instance and the centralized instance are going. So, they’re looking at creating these kinds of regional hubs for all of their different plants.
BEN CORBETT: 36:26
Kind of a relatively similar setup. This is a FTSE 500 aerospace company, so it’s in the manufacturing space, and they’ve got this kind of digital transformation over creating a smart factory. And what they needed to do was pivot away from a traditional data historian so that they could use modern development techniques against InfluxDB. So what they’ve done is just got on their OPC UA server, they set up an MQTT broker, and they’ve got Telegraf subscribing to that, which then drops the data within InfluxDB where they’ve got a Grafana server, which is used for just the kind of traditional visualization and dashboarding, and they’ll continue to build on top of those servers for their kind of more predictive analytics and maintenance routines. So that was me kind of rattling through really quickly, I guess, a little bit of housekeeping at the end. We don’t ask you to take our word for it. Try it out. Whether you want to sign up for InfluxDB today, download open source, go onto the cloud platform, and try it out. You can do all of those, or you want to come through and request a POC and work with someone like myself to see how we can meet your requirements. You can also subscribe via the cloud marketplace if you have credits to use. Go ahead and do that and check out all of our fantastic learning resources that we have online. Obviously, as an open-source-first company, we have to have fantastic documentation because we can’t possibly support 800,000 instances of open-source. So, I’m sure you’ll find our Slack channel, InfluxDB University and Documentation amazing. And that brings me on to just this. So, a couple of resources, which we’ll also send out to you guys after this. So, we’ve got the community forum, Slack channel, our documentation online, and the university resources. Awesome. I think that’s everything from me. Katelyn. Before we move over onto the Q&A, anything else to add?
CAITLIN CROFT: 38:23
Yeah. This slide just has a few additional resources that might be interesting to you guys. So, another webinar and just learn more about how InfluxDB can help on saving 96% on data storage costs. Excuse me. All right. Let’s jump into the questions. You ready, Ben?
BEN CORBETT: 38:49
Let’s do it.
CAITLIN CROFT: 38:50
All right. So, the first one is “I’ve been using InfluxDB for three years for a historian application. We are using it for plant data. One feature/option we miss is connectivity, using ODBC. Having ODBC would allow us to interface InfluxDB to legacy systems, i.e. Microsoft. Unfortunately, currently, we cannot find any driver to have connectivity with the Microsoft ecosystem, Microsoft SQL, Microsoft Reporting Service, etc. So, my question is, what are my options going forward because legacy stack is not going to be upgraded anytime soon?”
BEN CORBETT: 39:31
Yeah. Yeah. So yeah, thanks for your question. So, we’ve got a couple of—so usually if there’s not a Telegraf—the front door to Influx really is the API, right? So, if you think about the gates of InfluxDB, line protocol needs to land on the API. The methods in order to do that are Telegraf, but obviously, if there’s not a Telegraf plugin suitable for ODBC, then we’re looking towards different options, assuming there’s no third-party plugin, which there probably won’t be in your case. We’re really looking at maybe using the client libraries to develop your own kind of data set, your own client, your own data capture and writing technique. So, there might be some code examples in the community and online, but I would always encourage you to kind of create a thread in the Slack channel and drop that question in there and see if anyone else has already solved this problem. We do also have people from InfluxDB which maintain those channels and take a look at them. But I would kind of push you towards if it’s something that’s going to be a little bit specific and not something we already have a Telegraf plugin for. It would be more of a case of, okay, we need to go for the kind of belt and braces approach of using the client libraries to develop your own client. When it does come to ODBC and JDBC in the new version of Influx, we do have the ability for you to integrate with Influx using those drivers from a read-only perspective, but that doesn’t really solve your problem with regards to accessing that data store and writing it to Influx. But basically, yeah, you can query using those drivers and those techniques in the latest version of Influx. Thanks, Katelyn.
CAITLIN CROFT: 41:26
Next question is someone’s asking about Flux as a query language.
BEN CORBETT: 41:28
Yeah. Was there a specific question, or should I just give a—?
CAITLIN CROFT: 41:31
I would say just kind of give a general overview of what we’re telling people about Flux.
BEN CORBETT: 41:37
Yeah, sounds good. So, I guess really speaking quite honestly, there’s two reasons you might have noticed on the version 3 slide that Flux hasn’t been carried forward into version 3. For customers that leverage it today on their existing solutions, don’t worry, there’s no scheduled end-of-life yet, and it will continue to be supported on your platform. There’s nothing on the roadmap, but basically, the reason it hasn’t been carried forward into version 3 is two-pronged. One is that the switch from the row-based database that InfluxDB, TSI, and TSM was really concerned with the amount of unique series or primary keys you had across your data set. Flux was really tightly coupled to that approach, that type of database. Now we’ve shifted to a columnar database. We’re less concerned with series and we’re more concerned with columns, and SQL was able to be really easily adopted based on that columnar approach. InfluxQL, due to its similarities with SQL, was also able to be adopted, but Flux hasn’t been able to be adapted basically, so I think that that’s the technical reason. And then the other reason was that Flux actually represented a little bit of a learning curve to the vast majority of our customers and commercial customers as well. So, a lot of our customers just stayed on InfluxQL, which I think was one of the reasons we were looking to really prioritize that in terms of backwards compatibility. So yeah, rest assured, Flux, if you’re using it today, feel free to continue to use it. But if you do want to take advantage of InfluxDB version 3, it would be a case of translating that Flux into either InfluxQL or SQL. Thanks, Katelyn.
CAITLIN CROFT: 43:29
Can you give more details on the use case for Edge Data Replication? Sounds interesting.
BEN CORBETT: 43:38
Yeah, so I think maybe in the follow-up to this we can send out our one-pager as well. But basically, it was developed for customers that had intermittent connectivity challenges. So, I think we’ve got a couple of customers that are using it with marine vessels, maybe mobile mining equipment, things like this, and they don’t have a great connectivity line out of their mobile assets or fixed assets or plan. So, what we wanted to do was basically give store forward or basically a disk-based buffer capability for you to be able to rest assured that that data is stored on disk no matter what the connectivity challenges are and that it will be flushed and replicated to InfluxDb when that connection is re-established. It basically means that you have that kind of local operational view and ability to dive into data at the Edge, but then you also have the collated kind of single pane of glass. One thing that we do find that customers that typically adopt it is that it really allows you to keep the Edge quite lightweight. You don’t need a lot of storage there, because let’s say you only need a buffer for a few days, so you can have, let’s say, worst case scenario, a week’s worth of disk space, and then maybe you can do some downsampling or processing or enriching at the Edge and only replicate the processed data or down-sampled data. So, you kind of solve that bandwidth challenge coming up to the hub. So, let’s say you’re storing millisecond precision, but you only want to keep a one-minute aggregate for the hub, for reporting. Edge Data Replication works really nicely with that use case.
CAITLIN CROFT: 45:20
We also had that set up last year at AWS re:Invent, so another really simple example of Edge Data Replication. We had a Keurig coffee machine at AWS re:Invent last year or two years ago now, I should say. And so it was really cool. We got to show off InfluxDB, showing how many coffees we were making. And I think it was just a Raspberry Pi. Do you remember, Ben, what was connected to the—?
BEN CORBETT: 45:45
That was it. Yeah, it was a Raspberry Pi that was looking at voltages and currents of the coffee machine. We did some processing to identify when coffees were being brewed, how many times it was being brewed, temperatures, stuff like that, and that data was being—that processed data was being replicated to the cloud. So, I guess in theory, if you had thousands of coffee machines at different customer sites, you could then have really good information about the kind of global view of that fleet of coffee machines to kind of tie it back into this use case.
CAITLIN CROFT: 46:17
Yeah, I just think it’s kind of cool showing kind of the range that Edge Data Replication, even if it’s just a little coffee machine or a giant industrial site.
BEN CORBETT: 46:29
Yeah. And just on that as well, Edge Data Replication is very popular, and in the slide that I showed with the different editions of influx, InfluxDB Edge, it’s going to be critical to that offering as well, right? InfluxDB Edge is designed to be dealing with the leading edge of data, so the ability to be able to replicate it to a centralized instance in a manner that’s durable is a really top requirement from all of our industrial IoT customers. So yeah, it’s definitely kind of strategically the direction that Influx is going. It’s nice to see some industrial IoT-specific functionality. Yeah.
CAITLIN CROFT: 47:10
Are there any use cases for a financial or a banking industry that you can share some light on?
BEN CORBETT: 47:17
Yeah, maybe we can do a webinar on that too. Yeah. So, I work with a variety of customers around FX data, trading data. So InfluxDB was actually initially developed to work with trading data. So, our CTO and founder Paul Dix was working on Wall Street, and he realized that all of the different, I guess, trading platforms or people that work with that data had their own inbuilt solutions to do the analysis. So, it’s the reason the TICK Stack’s called the TICK Stack. It’s kind of a nod towards financial tick data. But yeah, I worked with a customer, I think Q3 last year, who’s now in production, who’s basically tracking bids and spreads within Influx. It’s just time series data over time, right? And usually, the key with financial data is they have quite intense query requirements. So, these customers have backtesting query requirements. So basically, they need to be able to, yeah, do some backtesting, kind of extensive analytical queries, which will work really nicely against the new version of Influx. So basically, yes, we do have some case studies for you, and please contact sales, reach out to Influx, and we’d love to walk you through what some of our other customers are up to.
CAITLIN CROFT: 48:32
Well, what is the retention policy for InfluxDB open source?
BEN CORBETT: 48:38
Whatever you want to put on it. It can be infinite. So, you configure the retention policy per bucket on open source. So yeah, whether you want to have it for a month or a year is entirely up to you. It just relates to how much storage you’re going to need for that instance, right? It’s exactly the same for InfluxDB version 3. The retention policy is applied to the environment as a whole and the relationship between the cached data in the hot tier and the object storage cold tier is dynamic. It can be configurable, but it’s not fully configurable. Your retention policy will apply across the whole period. So yeah, whether you want to store data forever or you want to store just the leading edge of data, you can change those retention policies to apply on the database or bucket level.
CAITLIN CROFT: 49:29
Cool. So, going back to Edge Data Replication, does Indabox aid or accomplish Edge Data Replication? So, the Terega use case.
BEN CORBETT: 49:41
Good question. I believe it does have a buffer, but it’s not a disk-based buffer, which is going to be as big as EDR. I think the purpose of the Indabox isn’t to deal with sites with poor connectivity. It’s to give you a secure, one-way flow of data out of the OT kind of network firewall. I think the buffer is just a little bit of durability, but it’s not like an EDR, which is specifically designed for port connectivity. So no, I don’t believe it does give that.
CAITLIN CROFT: 50:24
If I have never used a time series database like InfluxDB, what would be the normal learning curve time to get a deeper understanding of InfluxDB for better use in professional projects? Expect the learner to have a couple years’ experience from working with their traditional database types like relational or NoSQL.
BEN CORBETT: 50:45
Yeah, it’s a great question. So, I know because I’ve gone through it, right? I’ve gone through it when I was first—I used to develop IoT platforms using Postgres back in the day, and then I was using document databases, NoSQL databases, and then I discovered Influx. So, I would say for someone that has a basic understanding of database management systems, InfluxDB has been the fastest, the shortest learning curve of any technology I’ve ever used. One of our key points is that we want to focus on developer productivity, and what you’ll see on our website we call time to awesome. So, ways in which we focus on that is meaning that we try and make it that you don’t have to be a super, super amazing developer in order to work with InfluxDB. We’ve got the community, amazing documentation, plugins, and client libraries which give you neat sets of functions to be able to work with the database, and obviously, administrator consoles and stuff like that. So, the barrier to entry commercially is zero. It’s open source and there’s a free tier of the cloud, and technically it’s very low. We like to say that we’ve moved developing MVPs or POCs from months into hours and days. You can really spin up an InfluxDB gathering metrics from your laptop in a matter of minutes. So, I’d encourage you to go online and take a look at some of our tutorials. So, I would say the learning curve is super, super low.
CAITLIN CROFT: 52:20
Yeah, and I would also say, definitely check out InfluxDB University. There are tons of instructor-led as well as self-pay training, and it’s all completely free. And if you get stumped, be sure to check out the forums and the Community Slack Workspace. The way that I always say it is if you have a question, there’s probably someone else out there who’s a little bit more shy, who has the same exact question. So don’t be shy. There are always people in there answering questions, asking questions. Ben, are you in the Community Slack? I believe you are.
BEN CORBETT: 52:55
Yeah, yeah.
CAITLIN CROFT: 52:56
Yeah. So also, I’m going to totally put Ben on the spot. If you’re brand new to InfluxDB, you get stumped. Go book Ben on the Community Slack. I’m sure he’d be happy to help, or he definitely knows someone who can help out. And we also have amazing DevRels who are there to answer questions as well. So even though it shouldn’t take too long, if you have experience with databases, don’t fret if it does, and we’d love to help you out.
BEN CORBETT: 53:22
Yeah, and there’s no pressure, right? It’s open source. Download it, kick the tires, play around with it, maybe spin up a cloud serverless. It’s not these big, scary, non-open-source technologies that you need to get a license towards to test. The barrier to entry is very, very low, and it’s super simple to work with. So yeah, I’m sure you’ll move quite fast.
CAITLIN CROFT: 53:41
And yeah, if you’re still getting familiar with it, a really easy use case is just download it onto your laptop and start playing around with it, monitoring your own laptop, and you can kind of get familiar with the interface and how it works, so.
BEN CORBETT: 53:57
100%. Yeah
CAITLIN CROFT: 53:59
All right, let’s see. If time series data is getting it out of order to InfluxDb that is being forwarded, is that synchronized? How do you keep local instances on-site synchronized with the cloud?
BEN CORBETT: 54:13 Yeah, good question. So out-of-order, rights, it’s not a problem for InfluxDB. So out-of-order or historic rights or backfilling, basically, meaning that data isn’t arriving in chronological order. Basically, as long as that data has the right timestamp associated to it, so the timestamp in which it’s generated, it doesn’t matter in which order it lands on the API. InfluxDB will handle it. InfluxDB does prefer stuff to be in order, because then, in terms of how the partitions are kept hot and compacted down, pulled back up to write into. Obviously, in an ideal world, everything would trickle in perfect chronological order, but that ideal case is never the case. So, for example, I’ve got a couple of customers who deal with, let’s say, satellites, and those satellites beam down data every 15 minutes to a ground station. So, an individual satellite’s packet is in order, but all of the packets arrive in random orders, and sometimes they miss a window and things like that. So InfluxDB does have the ability to handle that. So, you don’t really need to worry about how it’s synchronized, as long as you make sure that the timestamp associated to the record is as you want it and is correct.
CAITLIN CROFT: 55:36
How do you buffer over the Edge? Do you use InfluxDB at the cloud—or InfluxDB cloud?
BEN CORBETT: 55:43
Yeah, a couple of ways to do it. Three ways really. Either you can use Edge Data Replication. It’s a specific feature. So, what you do is you go up and you set up a remote connection and then a replication stream. It’s just a couple of lines of InfluxDB CLI that you would apply to the Edge instance. So, you’ll just need the credentials of the hub instance that you would like to replicate to. That can be version 1, that can be version 2, and it can be the cloud, it can be open source, whatever you want. So that’s one way to do it, EDR. The second way is to use the client libraries and do it yourself. So basically, just have InfluxDB at the Edge doing its thing, and then your client would grab some data, try to write it. If it fails, it handles the error appropriately and your InfluxDB acts as the disk space buffer anyway. So that’s a way to do it yourself. Some customers use Flux tasks as well. I would encourage customers not to use Flux tasks just because Flux isn’t carried forward to version 3. So, I couldn’t confidently say it’s a future-proof solution, although it would work today.
CAITLIN CROFT: 56:54
I’m wondering, what is the most performance-efficient method or data model to handle duplicate data points when we want to preserve all old data points as distinct versions. In SQL databases, you can use upsert and on-update triggers to store duplicates in another value history table.
BEN CORBETT: 57:17
Yeah. Okay. So, for this question, I’m going to assume that the duplicates that come in, you don’t mind. InfluxDB has a last-writer-wins mentality, which basically means for any point that comes in, if it has the same tag set and the same timestamp and it’s populating the same field key—so for example, for sensor A, you’re writing in a temperature for a specific time. If that temperature is a certain value, and then after that, at the same time you want to overwrite that value, InfluxDB will allow you to do that. You can overwrite and update values. It’s really important to remember it is just last writer wins. So, I think some database solutions and time series databases are first writer wins. So yeah, we have the ability to handle that. If you wanted to have both versions of the—store both duplicates, I would encourage you to add in an incremental version number as a tag. So, you’d have version 1, version 2 written in, and you’d make sure it wouldn’t overwrite and you could have two records in the database, and you could filter on the different versions.
CAITLIN CROFT: 58:34
Let’s see, someone’s asking, does the Edge version of InfluxDB—is it fully functional for queries on stored data?
BEN CORBETT: 58:44
Yeah. So InfluxDB Edge, the offering isn’t out yet. That is the version 3 version of open source, basically. Check out one of our articles online, which is titled, “The Plan for Open Source,” and it will talk about how open source is splitting out into InfluxDB Edge, InfluxDB Pro, and InfluxDB Community. So, it’s actually not out yet, but those will all have varying pieces of functionality with regards to how they handle data, ability to handle historic queries. As a little bit of a spoiler, how it works is InfluxDB Edge will be a, I think, open-source edition free to use, and it’s just designed for the leading edge of data, so it won’t be optimized for historical queries at all, just the leading edge of data. You then have InfluxDB Community, which is closed source but free, which will be everything Edge is and more, and you will be able to query historic data. So, a single node edition of Influx. And then you’ll have InfluxDB Pro, which is basically—it will have additional features associated to enterprise customers like role-based access control. It will have a little bit of makeshift [inaudible] so you can replicate between instances, but that will be contracted. You will have to pay for that version. So that’s InfluxDB Edge. Throughout this presentation, when I talk about Edge Data Replication, though, you just need InfluxDB version 2.6 or higher. So yeah, that is designed to—it’s a single node, InfluxDB, which is designed to hold data for as long as you want it to. Yeah.
CAITLIN CROFT: 01:00:31
In order to warrant the cybersecurity, do you suggest use of Indabox as the firewall in between the OT and IT networks?
BEN CORBETT: 01:00:43 That is entirely up to your InfoSec department. I would say Indabox is probably the belt and braces approach. It’s literally a data diode, so data like electrons cannot move back the other way. So, it is really a kind of belt and braces secure approach. For many organizations that I work with, they just have an InfoSec team which configure appropriate firewall rules, which is okay. But I think the Indabox represents a really nice way to lower that barrier to entry for solutions like this, for the cloud, for companies which are kind of inherently worried about that accessibility and maybe have more enhanced security requirements, so it kind of reassures them and satisfies them. But also, if you can just get away with configuring the right firewall rules and you’re on that security, you can just go ahead and use that.
CAITLIN CROFT: 01:01:42 What is the best way to do anomaly detection with InfluxDB and Grafana as the alerting system?
BEN CORBETT: 01:01:50
I think we’ve got another webinar on that. I’d have to refer to that, but I do believe there’s a couple of packages that you can use within Influx which work towards anomaly detection. There’s anomaly detection. There’s also forecasting methods. I would just encourage you to take a look at our blogs and documentation and literally type that in and have a look at the article that comes up. I think we’ve got a few tutorials on it.
CAITLIN CROFT: 01:02:15
Cool. Well, thank you everyone for joining today’s webinar. Thank you, Ben. He did an amazing job—
BEN CORBETT: 01:02:21 Thank you.
CAITLIN CROFT: 01:02:22
—presenting and handling all the questions. For those of you who need to drop, thank you for joining. I know we’ve gone over a few minutes, so really appreciate you sticking around. Once again, this webinar has been recorded and will be made available tonight or tomorrow morning. If you have any other follow-up questions, everyone should have my email address. Feel free to email me and I can put you in contact with Ben. And if not, I’ll see you on the next webinar training. Thank you everyone.
BEN CORBETT: 01:02:53
Thanks for your time everyone
[/et_pb_toggle]
Ben Corbett
Solutions Engineer, InfluxData
Ben Corbett is a Solutions Engineer at InfluxData. He has been working in the real-time solutions space since graduating with a 1st class Masters from the University of Bath in Mechanical Engineering, putting his love for connecting the digital and physical worlds in practice through developing many IoT platforms and real-time applications across the Energy, Manufacturing, Agriculture, Construction, Smart Building and Electric Vehicle spaces. Since joining InfluxData, Ben has worked with our EMEA and APAC customers to understand and demonstrate the value of InfluxDB as applied to their specific use case or time series challenges.