Announcing InfluxDB Clustered
Session date: Sep 19, 2023 08:00am (Pacific Time)
InfluxData is excited to announce InfluxDB Clustered, the self-managed version of InfluxDB 3.0 with unparalleled flexibility, speed, performance, and scale. The evolution of InfluxDB Enterprise, InfluxDB Clustered is delivered as a collection of Kubernetes-based containers and services, which enables you to run and operate InfluxDB 3.0 where you need it, whether that’s on-premises or in a private cloud environment. With this new enterprise offering, we’re excited to provide our customers with real-time queries, low-cost object storage, unlimited cardinality, and SQL language support – all with improved data access, support, and security! The newest version of InfluxDB was built on Apache Arrow, and using the open source ecosystem and integrations, it allows you to extend the value of your time-stamped data like never before.
Join this webinar to learn more about InfluxDB Clustered, and how to manage your large mission-critical workloads in the highly available database service offering!
In this webinar, Balaji Palani and Gunnar Aasen will dive into:
- Key features of the new InfluxDB Clustered solution
- Use cases for using the newest version of the purpose-built time series database
- Live demo
During this 1-hour technical webinar, you’ll also get a chance to ask your questions live.
Watch the Webinar
Watch the webinar “Announcing InfluxDB Clustered” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Announcing InfluxDB Clustered”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
- Caitlin Croft: Director of Marketing, InfluxData
- Balaji Palani: Vice President of Product Marketing, InfluxData
- Gunnar Aasen: Senior Product Manager, InfluxData
Caitlin Croft: 00:00:00.961 Hello, everyone, and welcome to today’s webinar. My name is Caitlin, and I’m joined today by Balaji and Gunnar, and we are very excited to be here to talk to you about InfluxDB Clustered. This webinar is being recorded. The recording and the slides will be made available by tomorrow morning. Find us in the community. Find us on the community Slack channel. We’re more than happy to help you guys out, answer questions, and all that sort of good stuff, even well after this webinar. All right. I think without further ado, like I said, you probably don’t want to hear me talk anymore. I’m going to hand things off to Balaji and Gunnar.
Balaji Palani: 00:00:40.617 Thank you, Caitlin. Hello, everyone. Good morning, good afternoon, good evening. I know a lot of you are joining from all across the world. It’s from Australia, Europe, and other places, India. So thank you for being here. My name is Balaji Palani. I’m going to be talking along with my colleague Gunnar, of course. I’m going to be talking about InfluxDB Clustered. We are super excited to launch this. So before I get into stuff, I lead product marketing at InfluxData. I have been here in the company for about five years now. This is my sixth year, starting my sixth year, and super excited to be here. I’m in Marketing, of course, but previous four years, I’ve been in Product, launched Cloud 2.0. You’re probably familiar with that. And super excited to be here talking about Clustered. So Gunnar, maybe if you want to take 30 seconds introducing yourself so people know who you are.
Gunnar Aasen: 00:01:42.417 Yeah. Hi, everyone. I’m Gunnar. I’m a project manager here at InfluxData, working closely on the Clustered product launch, as well as a lot of other 3.0 bits and pieces.
Balaji Palani: 00:01:54.712 Great. Yeah. Thank you. And Gunnar has been here for, what, seven years or eight years now? Yeah. Awesome.
Gunnar Aasen: 00:02:03.536 I’ll just say it’s a long time. [laughter]
Balaji Palani: 00:02:05.717 Yeah. Yeah. No, good to have you, Gunnar. All right. So today, I know we talked about InfluxDB 3.0 before when we launched the Cloud Dedicated. I just want to take a few minutes, few slides, walking through 3.0, again, reintroducing that. Why are we excited about 3.0? Maybe to give you a little bit of sneak peek into the architecture. Gunnar will then kick over, and then he’s going to talk to you about Clustered, specifically what it is and who is it for, in what format, how is it going to be installed, deployed? How is it going to be used? Maybe he also has a short demo that you can see it in action, hopefully later on today after I’m done, and then we’ll wrap it up. So that said, let me talk about InfluxDB 3. So before I get started with my content, I just want to put this out there. I believe this is [inaudible] time series data is foundational through most applications and services. I mean, a few years ago, time series data are used to be like, “Hey, just another sort of data. We’ll just put it into a relational database or maybe a search engine” or whatever. But because time series data is so unique in terms of how its characteristics — it’s changing over time. But there are so many other issues. We saw that the normal databases wouldn’t really work with this data. Time series data can be found anywhere. If you’re monitoring, if you’re maybe resolving new incidents or trying to troubleshoot incidents. You’re looking at log files, tracing data from that. If you are in IoT, if you’re trying to collect data every few seconds, all of those are time series data.
Balaji Palani: 00:03:58.641 Essentially times series data is you have a series of data that’s timestamped because it changes over time, essentially, right? So that foundation out of the way. Some of the use cases you will find these time series data are the metrics data lake for monitoring, basically. In other words, you’re building your own custom monitoring solution because you’re not getting the same thing from Datadog or few other things that’s really built or you can buy off the shelf. You want to build it yourself. You want to put metrics, logs, and try to bring them together. That’s metrics data lake for specifically monitoring observability. The other use cases would be you’re trying to build something real time analytics for your IoT. Maybe you have thousands or even millions of devices or sensors out there, you’re collecting data from them, you’re trying to analyze them in real time, fire off alerts, or trying to automate something. That would be real-time analytics for IoT. Or we have several customers in the third bucket, which they’re building applications and platforms. Essentially IT application or maybe log analytics platform or tracing as a service, where you’re collecting this data on behalf of your customer and trying to store them in a backend database and then building a mobile or a web application to provide them access to that data or do, I don’t know, anomaly detection and things like that. So these are custom applications that you would build for your customers. And these two are really monitoring observability and IoT. So with that said — again, this is a very high-level kind of use case.
Balaji Palani: 00:05:47.699 Commonly, we find that the challenges with managing time series again, if you’re new to time series or if you’ve already used time series, you will see that there’s massive scale. You’re looking at essentially when you build out. You’re looking at millions of data points incoming every second. You have to handle that. You have to kind of analyze that in real time as they come in. And not only that, you may have to fire off alerts or do some sort of anomaly detection or something like that, or machine learning or things like that. So you have to take real time action on it. And data cardinality is another key challenge specifically, which you can find it in time series applications where because it’s a time series data, it changes over time, especially if you’re trying to — so for example, connect metrics with traces or logs. You’re trying to collect as much context as possible. And these you collect using key value pairs or tags, what we call in time series data sense. But when you collect them, you’re essentially building indexes. Your data access pattern could cause your queries to be slower or ingest rate to be slower, taking longer time, and so on. Again, particularly it impacts performance. As you grow your data cardinality, you’ll find that this becomes harder and harder to manage that from a time series perspective. All of these challenges are very well known and InfluxDB is pretty popular. We are pretty popular. We have great products out there already, 1.x.0, which can handle these. But I think with 3.0 we are essentially taking it, and we have managed to kind of bring the performance to really massive scale.
Balaji Palani: 00:07:41.163 For example, on cardinality sense, we have eliminated cardinality as a concern. I can talk more about in a couple of slides. But essentially what that does is you can collect as much tags as you want, and it’s optimized for reads and writes. We are also with 3.0. It is a columnar database. With the technologies we’re using, some of the hot data, live data, streaming data is now in memory. It’s cached in memory caches and available for your queries to be answered in real time. This opens up so many opportunities, unlocks opportunities for anomaly detection, real-time alerting, and things like that. And really exciting times for — this pushes InfluxDB beyond, “Hey, I have to wait for the data to land into my storage, and so it may take a few seconds to respond to queries,” and so on. It’s essentially real time. Streaming data, you can analyze and respond pretty much in real time under second. And then last but not least, we’re now with 3.0. We are also introducing your data to be stored in object store. Previous versions of InfluxDB, you can store it in SSE that has a cost associated with it. So you have to really compromise between, “Do I store this forever, or do I pay that cost, or roll over my cost and keep a little bit of data.” So again, all of these things essentially with the lowest cost storage, with the ability to store your cold data in object store. We are essentially given — you’re faster, it’s made it faster, it’s better, and in terms of — it lowers your cost as well. Let’s look at each of these benefits and how it actually helps you as a customer or a user of InfluxDB 3.
Balaji Palani: 00:09:33.626 So some of the benefits are, again, now you can store within a single store, single data platform. You can store metrics, events, and traces. This used to be a common concern with InfluxDB previous versions of that that, “Hey, I’m unable to store this. We are heading cardinality issues.” So you may choose to use different solutions or tools for storing metrics, events, and traces. With this, I can store all of them in a single data store, and I can bring them together, reduce operational complexity, things like that. And the way we did this was we have — for example, as data lands in, we partition by time. We use a catalog for accessing data pretty fast. Our internal catalog can locate the data. Access is made faster. We’re not really relying on indexing for really accessing that data. We have optimized. Some of the query engine that optimization we have done is how to access these files. And then again, these are columnar storage, and the way we access them makes it super fast. We don’t rely on indexing or the way data is stored and so on. Some of these limitations, when you remove them, it really helps out in terms of storing, bringing metrics, events, and traces and reducing your operation complexity. We also deliver subsecond query responses for recent edge of data. What do I mean by that? So, as data lands in, in previous versions of InfluxDB, we used to persist that data and then make it available for query. In InfluxDB 3, that is no longer the case.
Balaji Palani: 00:11:14.264 As data lands in, it’s available in memory caches, making it queryable. That’s pretty much immediate, almost within a second we have seen data access for real time. The last five minutes data or even last minute data available within milliseconds for you to query. And we use Apache Arrow. For those who are not familiar with Apache Arrow, Apache Arrow, Apache Parquet, these are all open source formats. We are really using open source format for internal data representation, making it best suited for columnar in-memory analytics. And then it is also optimized for these instance responses for live or that recently query data, really making it faster, enabling those real-time use cases like I talked before. What about the longer-term data, right? Again, here, with the in-memory cache, you can deliver real-time data, but for the longer-term data, we are using Parquet. So the data is persisted using Parquet, but we are also using what we call Arrow DataFusion. This is, again, an open source that we actively contribute to. It’s a query engine which has vectorized execution, which has push-down strategies. Let’s say you’re looking for, “Hey, I’m looking for count of x for this column x and column y.” It knows how to optimize it. It knows how to access that data without really reading everything into memory.
Balaji Palani: 00:12:56.820 So all of these optimizations are part of these data fusion query engine, and we actively contribute to it. There’s a lot of time-seized elements that we have contributed to it, and we use that data fusion query engine in all of our InfluxDB 3 with parallelism, with data partitioning, all of these things to really make it faster even for queries which have longer time ranges. What about data storage? So, because we use Parquet, Parquet in a sense provides us really great compression on data, much greater than what used to do it with our InfluxDB TSM or InfluxDB older versions. And furthermore, we are using object store. So when data comes in, it’s converted into Parquet, or it’s persisted as Parquet directly in object store. So you can see in-memory component and the data being persisted as Parquet on object store. And this object store makes it three to 5x cheaper than SSD. And combining with the compression factor that you get, we have seen internally that customers can get up to 10x reduction in storage cost. So imagine you’re storing terabytes of data on SSD in the previous versions of InfluxDB. With 3.0, you’re able to get 10x more that you can basically reduce your cost, or let’s say you can store 10x more data with the same cost that you have. So really very useful.
Balaji Palani: 00:14:31.922 Last but not least, we believe that Project Parquet, being the open data standard that it is, it enables interoperability with machine learning and advanced analytics tools. We have many examples of using Spark or even Pandas dataframes to work directly with Parquet. And all of these happen in a way that — we believe in the zero copy data sharing, right? So the data is already in Parquet. We can enable zero copy data sharing using [inaudible] and those kind of integrations that allow you to read these data into or access this data from Snowflake, from your Data Lake platforms or ML or AI platforms. And this makes it really really — now your data is democratized. It is now interoperable with other ML tools that your data scientists can or data engineers can operate on and so on. So how about performance, right? So all of these what you saw was benefits from an architecture perspective, but we’ve also seen major improvements in performance. Specifically InfluxDB 3 is 45x better. It has better write throughput, about 45x than previous versions of InfluxDB, about 90% reduction in storage cost, high cardinality data.
Balaji Palani: 00:16:03.084 We used to be very slow or really interoperable in previous versions. Now we are seeing 100x faster queries even with high cardinal data and anywhere from like 5x to 45x. Again, it depends on the query, but up to 45x faster for recent data. And there are a lot of improvements we have found. We have published a benchmarking white paper that compares open source versus InfluxDB 3. So you can take a look at it, go to our website, it’s available in the blog and so on. So again, all of these major improvements is available in InfluxDB 3, making it super useful for users and customers. So this is a quote from one of our customers who are one of the early adopters of InfluxDB 3. You can see European XFEL. They’re an x-ray free-electron laser facility. They use InfluxDB, and they involve about 12 European countries and so on. And again, this is something — we have other commercial customers as well that we are actively working with using InfluxDB 3 and so on. So I’m going to hand it over to Gunnar to talk about Clustered from now on. Gunnar, I’ll walk through the slides, and then you can just keep going.
Gunnar Aasen: 00:17:33.150 Great. Thank you, Balaji. So as Balaji just went over, 3.0 brings a lot of really interesting capabilities to the InfluxDB ecosystem. And kind of from the InfluxData product side of the house, we are introducing a new product, the next product in 3.0 line called InfluxDB Clustered. And we’ve already introduced earlier in the year InfluxDB Cloud Serverless, as well as InfluxDB Cloud Dedicated, both cloud services, database services to bring 3.0 to you all as a managed service. InfluxDB Clustered is a self-managed service, and the first time that we are releasing a 3.0 product that is completely self-managed and self-hosted. Could you advance one more slide, Balaji? All right. So like it was mentioned, InfluxDB Clustered allows you to leverage 3.0 functionality all on your own custom infrastructure. And I’ll get to some of the benefits of why that’s interesting in a little bit. But just to reiterate, this brings you the ability to run with huge amounts of cardinality in the tens or even hundreds of millions or more and unlock new use cases on InfluxDB, such as tracing and other ephemeral time series use cases. It also unlocks significant scale so millions of values per second and greater InfluxDB 3.0.
Gunnar Aasen: 00:19:21.542 In particular, InfluxDB Clustered being a self-managed solution allows you to leverage all of your own infrastructure to really make InfluxDB a sync for all of your metrics data without needing to do any aggressive downsampling or cutting out metrics. And then also 3.0 brings you additional capabilities on the query side of the house too, as well, in particular, the introduction of SQL as a first-class query language and InfluxQL’s first-class query language as well. And then also there’s been a lot of — we have lot of blog posts on our website about the data compression and use of Parquet files underneath. That’s it. Next slide, Balaji. Thank you. Not sure how many of you all out there are InfluxDB Enterprise customers of InfluxData, but InfluxDB Clustered is, in some respects, the evolution of InfluxDB Enterprise. It is a significant departure in terms of the architecture and deployment and the components that all come together to create InfluxDB Clustered. But in the sense of being a self-managed, clustered, and highly scalable, dependable enterprise solution, InfluxDB Clustered is in that square, in that target range. Next slide. So, as mentioned, you gain all the 3.0 capabilities.
Gunnar Aasen: 00:21:08.275 And in addition to that, Clustered comes to you all as a packaged Kubernetes solution with a ton of flexibility in terms of how you can work with the cluster components to meet all sorts of unique environments that you may be running in your organization, to be able to run InfluxDB Clustered with various bits and pieces that you would like to run it in and meet not only regulatory requirements, but also fit well within. However, you may deploy and operate Kubernetes clusters. Next slide. Yeah. So who is InfluxDB Clustered for? So InfluxDB Clustered, as the name would suggest, is a clustered deployment solution. And so if you’re familiar coming to us from the open source side of the house, using InfluxDB on a single node, InfluxDB Clustered is for kind of the next step in scale and size of workloads. In particular, we find large enterprises find a lot of use in using InfluxDB Clustered for the flexibility of being able to scale out, to meet the potentially enormous workloads that you may be running as well as being able to have the flexibility to configure a whole bunch of different attributes and parameters that aren’t available in our cloud products just because it’s not possible for us to, on our cloud products, deploy in your environment and meet all the numerous configuration setups that you all run.
Gunnar Aasen: 00:23:04.958 And then also allows you to graduate even more in terms of the security aspects, leveraging a lot of the underlying Kubernetes fundamentals to be able to really lock down your InfluxDB cluster and deployment kind of at the next level. Next slide. So just going to run through a few kind of the core use cases and targets for InfluxDB Clustered. So to start, one is just massive performance, and this is where InfluxDB Clustered and 3.0, in general, really shines. In particular, our Cloud Dedicated product as well operates very similar to Clustered in the sense that when we built Clustered, we actually pulled a lot of the components of our managed cloud offering to help create and inform the seed of the Clustered deployment. And out of that, over time as we improve both our management of InfluxDB Cloud Dedicated that will all feed into Clustered and make it easier for you all to deploy and manage Clustered as well. So definitely on the top of the list of kind of reasons you’d want to use Clustered is a need for massive scale and performance. So the 3.0 architecture separates out more cleanly the ingest and the query side of the house for InfluxDB.
Gunnar Aasen: 00:24:52.224 So if you’re familiar with kind of the previous versions of InfluxDB, both open source, enterprise, ingest and query are kind of very intertwined. In those previous versions of InfluxDB with InfluxDB 3.0, we separate that out to a much cleaner degree, allowing you to scale, depending on your use case, both scale on the ingest side to be able to handle huge amounts of ingest even if your query volume is relatively low, and then vice versa. Obviously, high query volume and high ingest, you can meet that need as well, including deploying both those tiers to potentially different node types and other optimizations to kind of utilize your infrastructure at the fullest. Next slide. Yeah. So the other thing Clustered gives you is, for all those companies out there who all of you have to deal with regulations around data governance, or you’re just in a company that wants to have full control over their destiny, InfluxDB Clustered allows you to have both complete visibility as well as the ability to tune your database to the very specific needs of your workload.
Gunnar Aasen: 00:26:35.938 And so there are already some controls in 3.0 that we expose through Clustered that aren’t necessarily available yet through even our cloud products that you can use such as custom partitioning options that you can use to basically improve certain aspects of your workload, depending on how that workload is shaped. And we’ll be having more docs and other blog posts coming out about these kinds of extra tuning options and configurations in the next couple of months. But I think this is going to give you, for the people who are operating at that massive scale, the ability to really go in and tune your analytics workload at a much finer grain than previous versions of InfluxDB, and even other products within the 3.0 line. Next slide. Yeah. And then finally, the enterprise grade security and compliance. This is where I was going before, which is there’s a lot of additional capabilities we provide in Clustered to be able to hook into your enterprise-grade secure environments, both in terms of lots of options around ingress and inter-node encryption and communication. We have tried to construct it to be relatively secure by default. And then we also have our shipping with an integration into identity providers via OIDC. So you can kind of leverage your existing identity provider setup to manage your database users on InfluxDB Clustered.
Gunnar Aasen: 00:28:35.122 Next slide. Yeah. And with that, I have a quick demo. So let me just pull it up. All right. So what I think might be the most interesting for you all to start is a — oh, sorry, put the wrong screen. So what is InfluxDB Clustered kind of at the core of it? So InfluxDB Clustered runs on Kubernetes. Kubernetes is a requirement, and we do have a number of other requirements to run Clustered. So if you’re coming at us from the — coming to us from InfluxDB Enterprise, there are a number of additional dependencies that InfluxDB Clustered requires, the main one being Kubernetes, leverages a lot of the underlying Kubernetes components to essentially deploy a bunch of separate services that overall compose the kind of overall InfluxDB Clustered setup.
Gunnar Aasen: 00:30:30.175 And so in addition to that, there are also several other dependencies. So again, just make a reference back to InfluxDB Enterprise. If you’re familiar with InfluxDB Enterprise, there is a concept of a meta nodes in InfluxDB Enterprise which store kind of overall cluster state. InfluxDB Clustered, we break that up, and some of that cluster state is now stored in Kubernetes via CRDs or Custom Resource Definitions, and also with 3.0 in general, but particularly for Clustered, we introduced the concept of a separate catalog store which actually uses Postgres underneath just as essentially metadata store for, in particular, a bunch of information about all of the Parquet files and many of the other underlying kind of database metadata that was previously stored in the relatively custom — on the enterprise side, previously stored in a relatively customization of a metadata. And so with Clustered, we switched that over to storing a lot of that information in Postgres, makes it much more accessible, also easier to operate just with the broader support of Postgres, both Kubernetes operators as well as managed options out there on the market. InfluxDB 3.0 also operates on object store underneath. So another dependency for Clustered is some type of object store.
Gunnar Aasen: 00:32:16.731 So for a lot of people who are running on the major cloud providers, that object store is going to be most useful if you’re running on a major S3 or Blob store or Google Cloud Storage, but it also runs on — if you want to deploy MinIO, that also works great. And then finally, we do offer an option to be able to hook into an identity provider via OAuth2, OIDC. So that is an additional option and that we tested a number of identity providers so far, including Keycloak, Azure Active Directory, Auth0, and a few other managed services. But we’re hoping to — obviously, the protocol, if it supports that generally works, there are some quirks between different actual implementations, but overall, we’re going through and doing some various testing on different implementations of that and expanding kind of the known support for those different implementations, particularly ones that are a little bit more quirky than others. So with that, InfluxDB Clustered requires you to deploy or have these dependencies available.
Gunnar Aasen: 00:33:57.416 Once you have these dependencies available, what InfluxDB Clustered consists of is essentially a set of CRDs that you can deploy, or we use a particular tool called [inaudible] that you can use to deploy a package of versioned CRDs and a schema for that. One of the tools that we use to help manage upgrades in versioning of Clustered. But once you deploy the set of CRDs, then you’re able to create an app instance, or what we call an app instance. And this just takes a number of — this is kind of the core of what your InfluxDB Cluster deployment will be. Takes a number of different configuration settings. In particular, you’ll notice things like object store, pointers, as well as various identity provider settings, and this isn’t the full — this is just the kind of example that we ship with. So this isn’t the full list of options. We do ship with kind of a schema over here that has the full exhaustive list of a lot of the different things that you can set.
Gunnar Aasen: 00:35:41.540 But once you have your prerequisites, you have a Kubernetes cluster, you can come to us, you fill out these configuration values to deploy it or apply it, and InfluxDB Clustered Kubernetes will go and deploy that InfluxDB Clustered setup. So just to show you what that looks like, I’ve got an OpenShift setup here with all this deployed. Like I mentioned, there are a number of different combinations and flexibility that InfluxDB Clustered supports using, basically, in some sense, essentially raw YAML manifest that you apply. And so one of these supported configurations now is Red Hat OpenShift. And so we support on the Kubernetes side. We do have a requirement that you be on Kubernetes version 1.25 as the minimum Kubernetes version that we support. But we also have a lot of other options that you can use for not just setting up deploying to something like OpenShift, but also setting up different ingress options on, say, various cloud providers as well as — although we do kind of have the best support in terms of default for the NGINX-based Ingress.
Gunnar Aasen: 00:37:39.314 Yeah. So once you deploy and apply your cluster setup, you’ll have a number of — we do actually, sorry, I’ll mention one other thing real quick, I forgot to mention, which is we do require a namespace. So in this case, generally recommend InfluxDB simply as the namespace in which to deploy or apply all these components. And then in terms of what actually gets deployed, there are a number of different components here. I think the most interesting by far in actually the workload components, which if you’re coming from — if you’re coming from InfluxDB Enterprise, there are two node types. There’s the data node type and the meta node type. I mentioned meta nodes earlier. And in InfluxDB Enterprise, the data nodes are the workhorse nodes. In Clustered, there’s kind of a trifecta of what is equivalent in terms of workload pods or deployments services that are exposed. And so first of those kind of workload deployments is the ingester setup.
Gunnar Aasen: 00:39:14.513 So the ingester is, as it sounds like, the service that handles all the ingest in terms of writes. So InfluxDB Clustered, like all three auto products, supports on the write path the InfluxDB V1 and V2 write HTTP APIs along with line protocol. And so the shared ingester — or the ingester is what is actually going to be taking those batches of points that you’re sending in and converting them on its side into Parquet. And the ingester also keeps a short-term cache. We also carried the naming convention, at least over the implementation is somewhat different. But the naming convention is the ingester keeps a wall or write ahead log, which is a short-term storage of hot data, essentially, which it makes available to the queryers when you run a query that covers a time range that is kind of in near real time. So the ingesters handle all that conversion to Parquet, actually writing the initial Parquet files and handling, basically, the fast path for real-time queries.
Gunnar Aasen: 00:40:53.989 Forgot to mention. Also there is a router as well. So there’s a router in front of the ingesters, which is how it actually sends the traffic over and handles some of the — making sure writes are distributed and also go to the right place from the ingester side of the house. And then so once the ingest data goes to the ingester, it eventually ends up in a Parquet file object store. In this particular cluster, we have MinIO running in the Cluster operating as our object store. And so once it lands an object store, there is then another service called a Compactor, and this is in InfluxDB. Previous versions of InfluxDB on the TSM Storage Engine, which is Storage Engine before 3.0, had a similar type of setup around compacting data periodically. In 3.0, this compaction process or set of processes is all extracted into its own separate setup, which is shown here. This will kind of run in the background, helps in terms of combining Parquet files that are created by the ingester into larger and more compact Parquet files over time.
Gunnar Aasen: 00:42:40.789 On the query side of things — on the query side of the house, you get faster performance on your queries as you have fewer files that need to be referenced in any particular query, which leads me to the last of the workhorse pieces of cluster, which is the Querier. So Querier is what it sounds like, it handles queries, and in particular what it’s doing is going out to the object store or it’s processing queries. But so natively process these SQL queries under the hood. We’re using part of a project from the Apache Arrow ecosystem called DataFusion. It’s a high-performance SQL engine — or high-performance query engine built on top of Arrow as a data format. And so really reduces the amount of — or as much as possible tries to reduce the amount of transformations required, any particular lookup. And so the query handles essentially parsing queries. And I think the other point — other three auto products as well, the query will handle both SQL and natively InfluxQL as well, which I think is an exciting addition to the 3.0 set of features that we found has been very useful. And so on that note too, the Clustered setup is able to support the V1 InfluxQL query endpoint as well.
Gunnar Aasen: 00:44:29.824 And with support for both V1, write, and query via InfluxQL, we’re able to offer pretty much parity from a basic database client perspective between InfluxDB 1.8 on the open source and 1.10, soon to be 1.11, on the enterprise side of things, to be able to seamlessly point those workloads over to Clustered and be able to just pick up and run with what your clients are already writing and querying. And we are working on migration tooling, as we speak, to make the kind of bulk movement of existing data over much easier. And so for anyone coming over from enterprise to Clustered or even from open source to Clustered, there’s going to be hopefully very simple migration on that front. So that finishes the trifecta of kind of the workhorse components of cluster, the ingester, the compactor, and the querier. There are then a bunch of other kind of supporting components. In particular, the other supporting component I mentioned earlier, the replacement for the meta nodes in Clustered is the catalog, which specifically just handles basically database metadata, including the list of kind of the canonical list of Parquet files that exist in the database and can be referenced for queries and other functionality as well. That’s called the catalog — or sorry, it uses Postgres underneath, the catalog DB underneath.
Gunnar Aasen: 00:46:27.848 And then we also have few other components, including an Auth component, handles some authorization, there’s an account component, and then there’s also a debug service that we ship as well that will make it easier for you to bundle up logs and other profile supporting information in your environment. When you do hit a bug and you do need our help, or if you need our help on the InfluxData side, to debug that issue and figure out is it a code problem, is it another configuration problem? The debugging service is designed to make it easy to gather the artifacts that we’ll need to be able to effectively debug those pieces. And then I’ll also end with — generally, for Clustered, we are using pretty much a lot of the stock Kubernetes components underneath. A lot of stuff can be configured and overlaid on top of the set of CRDs that we provide so that you can do some pretty specific overlays on top of what you deploy to deploy various ingress setups and other setups to basically be as flexible and a good Kubernetes citizen as possible from that perspective. And just as I was getting ready, I screwed up the ingress on this particular cluster. I was just playing a demo, so it’s not actually working right now, unfortunately.
Balaji Palani: 00:48:34.429 Nobody’s going to know. I was just going to say that we may be running short on time. So I think there are a lot of questions that folks have asked, that maybe we spend some time on that. Thank you. So let me re-share my screen again just to wrap it up. All right. So getting back on track here InfluxDB 3, we launched Cloud Serverless earlier this year, followed it up with Cloud Dedicated, which is essentially Clustered, but operated and managed by us for a single customer, single workload. Serverless is a shared system, where you can just sign up for an account and then run your workload. But it is all the files and everything underneath the shared, except it’s logically separated, and a cluster is of course very similar to Cloud Dedicated, except it’s managed on-premises with you. There are more coming down the line, but this is kind of pretty much the 3.0 product portfolio, and we are super excited to be releasing so many things as we announced last year, I think, when we announced IOx. So this is my last slide. Again leaving you with — hey, it’s all better, faster, and it reduces your TCO. If you want to really spin up Clustered, talk to our sales team. It is not self-served. So you do have to talk to a salesperson in order to make sure we give you kind of the access to Clustered, also how you set up your trial or POC or whatever. So do click on that and then go check out Clustered. There are plenty of documentation available, so check those out. Or if you want to talk to us, just ping us or contact us, and we will get back with you. This was my last slide. So I think we should get into Q&A. I’ll hand it back to Caitlin.
Caitlin Croft: 00:50:34.415 Awesome. Thank you so much. Thank you, Balaji and Gunnar. There’s a ton of questions. So we’ll start going through them. What I’ve read so far is that InfluxDB 3.0 will provide InfluxQL or SQL query language. Is there a plan to support PromQL, which is known as the other famous query language for time series?
Gunnar Aasen: 00:51:03.428 Yeah. I can take that one. So InfluxDB Clustered and 3.0 generally right now has full support or native support for SQL and InfluxQL currently. We’ve not rolled out supporting PromQL, but it’s not currently on the near term, so leave it at that.
Caitlin Croft: 00:51:22.620 All right. I’d say just also stay tuned. We always promote everything that we’re doing on our blog, so. With InfluxDB 3.0, is it still best practice to store device UUID as fields rather than tags, or has the unlimited cardinality changed those best practices?
Gunnar Aasen: 00:51:44.168 I think this one’s is —
Balaji Palani: 00:51:44.513 I think the question here is whether you store it as fields or tags. That used to be a concept in the previous versions. Now with InfluxDB 3.0, in the back end, all fields and tags, they are stored as columns. Of course, with unlimited cardinality, you can store as many columns as you want. We don’t stop you there. Obviously, if you use the serverless, there are some default restrictions that you can’t go beyond, I believe it was 200 columns for a single measurement or table, but otherwise, those are configurable on the Clustered and Cloud Dedicated as well.
Gunnar Aasen: 00:52:29.133 Yeah. And I’ll add on to that too that, like I mentioned earlier, there are some additional knobs that 3.0 and Clustered, in particular, are exposed to help with optimization, particularly on the query side. And so, yes, there’s unlimited cardinality. It is not necessarily because you’re storing unlimited cardinality mean that you can go and query all that unlimited cardinality and have everything be quite as fast as you otherwise would have with less cardinality. But there are some knobs that we expose in Clustered that allow you to be able to do some optimization on, say, like a UUID to do some partitioning of that data in a way that will let you get much better query performance.
Caitlin Croft: 00:53:22.249 Are there plans for there to be a Kapacitor equivalent? If there are no plans for this, is there a commercial metric processing engine that will be able to provide this function?
Balaji Palani: 00:53:39.527 Gunnar, that’s yours.
Gunnar Aasen: 00:53:40.405 Yeah. [laughter] Yeah. So there is not going to be a new Kapacitor equivalent. So in the 3.0 kind of paradigm and the way it operates, it’s all built on using, at least on the query side, Apache Arrow buffers underneath, and passing around directly that data in Arrow format using a new protocol called Flight, as well as Flight SQL, which is a derivative of that, which allows you to basically pass those error buffers with no or minimal actual kind of deserialization or marshalling required in comparison to, say, having to marshal things into JSON or another format. And so what that allows you to do is it allows you, one, to fetch much more data in your query at a much faster cliff than you otherwise would, but it also lets you on the client side, work with that data in a format that’s already optimized for analytics. And there’s a significant support in the data analytics ecosystem for Arrow that’s only increasing.
Gunnar Aasen: 00:55:07.328 And so for 3.0, in general, kind of our focus is shifting back really onto that core database experience and kind of letting users use whichever scheduler or analytics workload management tooling out there that best fits their need. So with any type of task scheduling and things like that, there’s always different slices and requirements of what’s required out of those tools. And there’s a lot of task schedules out there built for very specific things, whether that’s data science workloads or more metrics workloads and other things like that. I will say that we do have plans to build in a type of materialized views essentially into the database itself. So there is some downsampling kind of that leverages the built-in query engine that is going to be possible within the database itself, but we’re going to kind of from the long term viewpoint being focused on extensibility for actual kind of general purpose tasks scheduling.
Balaji Palani: 00:56:33.029 Let me ask one more thing to make it really simple. If you are using Kapacitor today, then the APIs of the Kapacitor user just works for InfluxQL, right? So we do support those InfluxQL APIs. So you should be able to use Kapacitor with InfluxDB 3.0 as well, although it’s only for the batch purposes. So if you have streaming, for example, subscriptions may not work, but just the basic Kapacitor that works with InfluxQL should continue to work in 3.0.
Caitlin Croft: 00:57:06.105 Okay. So there’s a few questions around pricing and licensing. So will Clustered be openly available like InfluxDB Open Source 2.7, or will it be licensed or paid system? And then there’s also questions around what is the licensing cost to be able to run InfluxDB Clustered? So, I mean, obviously with pricing questions, I would just say talk to our sales team because they can best understand your workload and be able to give you more adequate pricing than myself, Gunnar, or Balaji probably could. Is there anything else you guys would like to add to those questions?
Balaji Palani: 00:57:46.225 Yeah. So with regards to pricing or licensing, it’s good to understand that we’re going to license InfluxDB 3.0 Clustered very similar to Cloud Dedicated. That is, it’s based on number of the total amount of CPU and the RAM. So Gunnar talked about ingester, querier, compactor, all of these are services that are running on — that’s going to take compute and memory, and you can have 10 ingesters, 20 queriers, or something like that. That configuration, it doesn’t matter how it is, the overall what is the compute and memory that you’re going to utilize, that will create your licensing, and the pricing is going to be based on that. The other question I saw something that I wanted to answer is I think there was a question about Open Source. We are working on 3.0 Open Source. We will make an announcement sometime pretty shortly on 3.0 Open Source and what our strategy is going to be. So please stay tuned for that, and we will answer all of your questions, including on the long-term viability of Flux support. So I think there was a question on Flux. What is Flux, how’s it going to be supported, and so on. Just to be clear, in 3.0, we’re only going to be supporting InfluxQL and SQL so that those are the two natively supported languages, although on Cloud Serverless, we do support Flux, and there is no change on that. But I think just to be clear, all of those, we will get clarity in the announcement that we’d be making shortly.
Caitlin Croft: 00:59:26.872 Perfect. Kind of along those lines, Balaji, there’s a question here, with the removal of tasks in InfluxDB 3.0, how is InfluxDB thinking about aggregating and downsampling data? Is the idea that the query performance enhancements will make this less needed?
Gunnar Aasen: 00:59:46.319 Yeah. Oh, sorry.
Balaji Palani: 00:59:47.620 Go ahead.
Gunnar Aasen: 00:59:48.916 Yeah. Like I was just mentioning in responding to the question about Kapacitor, yes, we believe that there are some use cases where it makes sense and it is also maybe more performed for us to do some processing in the database itself. In particular, a lot of downsampling and other kind of summarization pieces are really kind of core time series and analytics functions. And so we will be adding in. It is not there yet, but we will be adding in essentially a type of materialized views to be able to use the SQL engine that we use in 3.0 Clustered to be able to essentially do some amount of processing on your data. It’s not going to be fully extensible like Flux and tasks were, but it will be fairly powerful from what’s available and performance wise.
Caitlin Croft: 01:01:06.141 Perfect. Let’s see. I know we’ve run completely over time. If you guys can stay on, let’s just answer a couple more questions, and then we can always answer questions afterwards over email or Slack and all that sort of good stuff.
Balaji Palani: 01:01:20.461 I’m happy to stay on more. Yeah, we can answer. Yeah.
Caitlin Croft: 01:01:22.840 Perfect. How do you distribute shared data across nodes based on a hash of the date time, aka all nodes, simple share, all workload, or based on data labels?
Gunnar Aasen: 01:01:38.068 I apologize. I do have to hop off right now, but I’ll leave it in Balaji’s capable hands.
Balaji Palani: 01:01:42.715 Yeah. I can take care of. Thanks, Gunnar.
Caitlin Croft: 01:01:45.952 Thanks, Gunnar.
Gunnar Aasen: 01:01:46.913 Thanks. Bye.
Balaji Palani: 01:01:49.020 So the question was — how do you distribute data across nodes?
Caitlin Croft: 01:01:52.589 Yes.
Balaji Palani: 01:01:53.708 I think the way we have architected InfluxDB 3.0 is all of the components and servers internally, they understand Apache Arrow. Apache Arrow is an in-memory format. So all different distributed nodes and components within that cluster, they can talk using Apache Arrow. It’s held in memory. So all of those in-memory components are actually shared. So if a query, for example, is requiring, “Hey, I need this data,” it asks for it, and any amount of data that’s available in memory is being sent over. All of that happens within Apache Arrow. Regarding data sharing across S3 or object store, it is done using Parquet. And once this Parquet is pushed out, our Data Fusion Query Engine has lot of optimizations that apply on how to extract that data and load it back into memory. That’s why any query that’s in Parquet, once it’s loaded in memory and you try to query it immediately, that becomes faster. But the first time you try to access it, there may be a little bit of a time delay. Again, we’re not talking seconds or something. It’s milliseconds. We have optimized it so much that that is happening. So I think those are all the architectural components which make it really fast, make it in terms of optimized for performance scale and so on. I don’t know whether this directly addresses your question, but if you want to talk to an architect or understand more about how data sharing occurs, please reach out to us either over community Slack, and I’m happy to set it up with our architects.
Caitlin Croft: 01:03:38.417 Perfect. Let’s see. I understand AWS support for cloud is limited to US East One and one availability zone in Europe. Does Clustered support multi-region deployments?
Balaji Palani: 01:03:53.667 So I think what you’re saying about the regions of support on our current cloud website is for Cloud Serverless only. Cloud Dedicated as well as Clustered, we can support on any of those regions. In fact Dedicated, if you want a specific region, reach out to us, and we are happy to spin it up for you. And with regards to multi-region support, I can take that as an action. We can note it down and then come back to you guys with an answer.
Caitlin Croft: 01:04:30.178 Are there benchmarks for how much of a performance hit querying time series data in cold storage imposes? Is the threshold for offloading hot data to cold storage configurable based on data lifecycle rules or just available memory?
Balaji Palani: 01:04:48.375 Great question. So there are two questions. I’ll answer the second question first. So offloading hot data into cold storage, there’s an internal algorithm that actually takes care of that. It is currently not configurable, although we might make it configurable in the future to say, “Hey, it might be time based. Anything which is 15 minutes — let it remain in kind of in-memory caches, whereas everything else offload it to cold storage.” For now, it’s all an internal proprietary kind of algorithm, and it depends on how much memory you have or allocated to the queryers and ingesters. It also depends on how much data is incoming. So if there’s a lot of data, we will probably push a lot of them to Parquet and then just store a very little bit of it in memory. So it depends on that. Are there benchmarks on how much? So, as I mentioned earlier in my slides, we use DataFusion. DataFusion is an open source contribution that we make through Apache Arrow Ecosystem. It has lot of optimizations. We do have performance benchmarks, but I don’t believe that we specifically benchmark on how much time it takes to access code storage. But it’s pretty fast. We’ve seen that as long as — again, there are many factors into it, right? So if you’re looking for data across time dimensions, it could be of a performance hit. But as long as you’re looking for data — again, data is split or partitioned by time. Usually, it’s 24 hours, and then all the Parquet files are presorted by certain columns and so on. But let’s say you are making it very complex query, that could take a long time to access that data and across Parquet and across different partitions, but it’s pretty fast that we have optimized for.
Caitlin Croft: 01:06:45.489 Let’s see. Gunnar mentioned Kubernetes is a requirement. Is there any scope for running on plain Docker Compose-style infrastructure?
Balaji Palani: 01:06:56.600 For now, in the immediate near term, our idea is to support Kubernetes-based installations for Clustered, and that is not going to change. But there are options in the future. I would advise if you have a requirement like that to talk to Gunnar or our product team. Gunnar, Gary, Rick, they’d be happy to talk to you and then think about options.
Caitlin Croft: 01:07:22.545 Balaji, there’s been a couple of questions around if Edge Data Replication is available with InfluxDB Clustered. Can you talk about that?
Balaji Palani: 01:07:33.751 Yeah. So Edge Data Replication is a feature that we enabled in open source. I believe it’s 2.3 or 2.5 and above. Basically, you can designate one or more buckets within those open source and say, “Replicate this into a cloud,” or — yeah, I think it’s cloud. And it uses the API. So Edge Data Replication, as it stands, still is supported. You can use it with your Cloud Dedicated. It would work, but there are going to be changes for the future, and that is something that we will address as part of the open source announcement that we mentioned earlier, so.
Caitlin Croft: 01:08:19.949 Perfect. When does data compression happen? Is it immediately after the data is flushed to disk? And is all the data compressed?
Balaji Palani: 01:08:30.646 So there are multiple levels of data compression that happens. Apache Arrow, which is the in-memory format, that’s built into the core of all of InfluxDB 3. There’s compression. It’s happening there. And then furthermore, when we write to Parquet, the Parquet itself is compressed. And that depends on several factors. I believe we try to compress in a manner that kind of lowers — it’s not optimized for compression, but it’s optimized for queries. But you can partition by different columns that could change the compression techniques, but there’s actually multiple levels of compression that happens. Also, I think Gunnar mentioned about Compactor. Compactor is another service which runs iteratively every couple of hours. Once you have a lot of files built up, it could take a long time accessing those files. So Compaction optimizes them by combining several files into a single file and trying to make sure that one single file is not too big, and so on, so. Also any late-arriving data, compaction will happen. Any data marked for deletion, Compaction helps on that. So Compaction is another aspect which actually helps on that compression.
Caitlin Croft: 01:09:53.640 How compatible is InfluxDB Clustered with industrial protocols such as Modbus, OPC UA, and all the others?
Balaji Palani: 01:10:05.230 So what you’re asking about is the data ingestion. We would continue to support Telegraph or Advertiser, promote whatever you call it, promote Telegraph as your data ingester or data collection agent. And Telegraph is open source. It has a whole bunch of plugins, I think 300 plus plugins. It supports Modbus OPC UA, MQTT collecting from AWS IoT, whatever. It has a whole number of input sources. And then InfluxDB 3 supports, as I said, I think REST APIs for ingestion, and we support Flight for queries. Flight will be extremely fast. I believe we also can query using APIs.
Caitlin Croft: 01:10:58.059 Awesome. Thank you so much, Balaji. I know there’s a ton of questions still. I think we should write some blogs, maybe some secondary webinars, because clearly there’s tons of questions around InfluxDB Clustered, which I’m very excited about. I think it shows that you guys are super excited about what we’ve been working on and are excited to try it. Once again, this webinar has been recorded. It will be made available for replay probably by tomorrow morning. So be sure to check it out. Everyone on this call should have my email address. Feel free to email me. I’m more than happy to connect you to Balaji and Gunnar and others on our product team or whoever else can help you answer your questions. Really appreciate everyone sticking on this webinar and staying on as we had so many questions. So really appreciate it. Thank you, everyone, and I hope you have a good day.
Balaji Palani: 01:11:54.971 Yeah. Just really quickly before we wrap, I know there’s a lot of questions about cost of Clustered. Please do us a favor, hit on Contact Us. If you go to influxdata.com, you can just set up Contact Us or Contact Me. So just fill up a very short form, and we will get back with you. Somebody will give you a call and set up a meeting with you to understand your use case and then provide you that cost factor. So that’s it. I’m super excited to be launching Clustered here. Thank you for listening. And thank you, Caitlin, for hosting.
Caitlin Croft: 01:12:29.756 Of course. Thank you so much, Balaji. I hope everyone has a great day, and I’m once again more than happy to field questions as well as you guys have them. Thanks for joining and have a great day.
Balaji Palani: 01:12:41.810 Thanks, everyone.
Caitlin Croft: 01:12:42.659 Bye.
VP, Product Marketing, InfluxData
Balaji Palani is InfluxData’s Vice President of Product Marketing, overseeing the company’s product messaging and technical marketing. Balaji has an extensive background in monitoring, observability and developer technologies and bringing them to market. He previously served as InfluxData’s Senior Director of Product Management, and before that, held Product Management and Engineering positions at BMC, HP, and Mercury.
Senior Product Manager, InfluxData
Gunnar Aasen is a Senior Product Manager at InfluxData. Gunnar was an early employee and the first support engineer at InfluxData. He now enjoys applying his deep technical expertise toward building developer-oriented products. He is based in Berkeley, California.