In this webinar, you will learn how to install an InfluxEnterprise cluster in your own environment.
Watch the webinar “Installation of InfluxEnterprise (cluster)” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Installation of InfluxEnterprise (cluster).” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Michael DeSa: Software Engineer, InfluxData
Chris Churilo 00:08.326 Okay, we’re going to go ahead and get started. Might as well kick off our Thursday right with another fabulous training from Michael DeSa. Just for your information, everybody on the call, Michael DeSa is one of our key developers on the Kapacitor project, but he’s actually going to be reviewing how to install the entire InfluxEnterprise cluster with you today. As always, these sessions are recorded so we will take the recording and post it to our website later on so you can take another listen, and I’ve already given you the links to be able to take a listen to all the other recordings. If you have any questions during any point of the session, go ahead put them in the Q&A or the chat panel. We’ll make sure that we get those answered before the end of the training, and then as always, please also post your questions in community. Our engineers are doing their best to basically live there and get all your questions answered. And I think the rest of the community would also appreciate being able to see those Q&A on various topics that they may also have an interest in. So with that, I’m going to pass the ball to Michael and we’ll get started.
Michael DeSa 01:21.540 All right, thank you very much, Chris. So as Chris mentioned, today we’ll be doing installing your InfluxEnterprise cluster. So we’ll talk a little bit about the pieces of InfluxEnterprise and then kind of move into the installation procedure. So the agenda for today is: What is InfluxEnterprise and what is InfluxCloud? Just talk a little bit about those things. We’ll talk about what an InfluxDB cluster is. What is a meta node? What is a data node, replication factors, and how these things all kind of come into play? Talk a little bit about the types of hardware and maybe a little, little, little bit about some ways you can set up your cluster. Putting a load balancer, what changing a replication factor does to your query latency versus your write latency, and then we’ll talk very briefly about what HA Kapacitor or Enterprise Kapacitor is, and then we’re going to close it all out and we’ll do a brief demo of an installation of an Influx cluster. I’ll be doing it locally so it will be in production. It will be slightly different but the idea is mostly the same.
Michael DeSa 02:35.713 So for those of you who aren’t familiar with who InfluxData is or who we are at InfluxData is we’re a time series company, database company or platform company that’s building a modern engine for metrics and events. So there’s a few things that we see kind of going on and we see a higher volume of real-time writes. We’re seeing both regular data and irregular data. Regular meaning it comes in at a regular fixed interval. Irregular meaning it comes in kind of sporadically and so we kind of coined those things metrics and events. We’ve seen this workload where people often expire data based off of some kind of tight schedule. I don’t want to keep around all this data forever. I want to keep it around for a month or a year or any number of things like this. And so that’s kind of the way we see the world trending is to these real-time, time-based kind of workloads and so we’re trying to build towards that future.
Michael DeSa 03:45.355 So here’s a list of our product offerings. There are three main categories. We have all the way over to the right here the open source TICK Stack. So this is all of the components, Telegraf, InfluxDB, Chronograf, and Kapacitor. They’re all open source, they are all free to use, and they’re all extensible. They both have all this scale that we talk about in this world of real-time time-based queries. They all have those components and they’re completely open source so it’s open to anybody that would like to use them. Moving on into the center, we have InfluxEnterprise, and this is where we start to define features that we see that are useful for enterprises or organizations. So if you need something like scaling out to millions or hundreds of millions of series and you have extremely high write workloads or you need high availability or you need advance backup and restore features, we see those things as a part of InfluxEnterprise. And that can run in any Cloud, it runs on Premises, however, you really want to set it up. But the main key piece about it is that you have to manage it yourself, but you do get this high availability.
Michael DeSa 05:09.453 And then all the way over to the left here, we have InfluxCloud, which is a great place to start if you don’t want to host it yourself. You want to just get something up and running and start going forward. InfluxCloud is a really good place to start. It’s everything you get in InfluxEnterprise and then we also offer you some additional—we manage the thing for yourself and we offer some support for anything that may happen during your experience. So if you’re having a particular issue or you want some help doing a little bit of design then you have access to our support team. So just to highlight that again, we have three product offerings. The TICK Stack, which is the open source version of each of the components. Then we have Enterprise, which is the Enterprise feature set associated with each of the components. So things like high availability to scale out and advanced backup on our store procedures. And then we have InfluxCloud, which is everything you get in InfluxEnterprise and then it’s just hosted for you. Well, we do the hosting and keep things alive for you. Just to separate those out.
Michael DeSa 06:27.214 So to go in a little bit more detail about what is InfluxEnterprise, and it is the full components of the TICK Stack, Telegraf, InfluxDB, Chronograf, and Kapacitor. And there’s some additional features that come with that. You get the clustering associated with InfluxDB, which can be used for high availability or scale-out performance. And then you get the clustering capabilities of Kapacitor, which are currently just for high availability. You get an enhanced backup and restore procedures. You get more features in Chronograf. You have fine-grained auth so you can specify certain measurements or tags that certain users can see. You have enhanced security and we’ve done a little bit more battle testing and hardening of the clustering components of it. And you get enhanced production deployment capabilities. So, as I said before, all of these things that we’ve just talked here in Enterprise, you have also in InfluxCloud. So if you want all of these InfluxEnterprise features but you don’t want to host it yourself, InfluxCloud is a great place to sort of move into.
Michael DeSa 07:41.605 So one thing I want to bring up here is this thing about time series index. So just recently, we had an addition to the database, which is called TSI. It is a disk-face index for your series. And essentially, what this has done is it has allowed you to scale out to very high cardinality workloads. So previously, InfluxDB really assumed that you are in the low millions of number of series, not more than that, and we’d start to see various kind of issues when somebody got larger than that. And so we’ve designed this to be something that scales not to the millions but into the billions. So if you’re doing a hundred million series or a billion series, we really want to work towards that use case because that’s the way we see things progressing. And so what this has allowed people to do is you can scale out the number of series with TSI, so you can get millions and millions of series into your InfluxDB. Whereas previously, you would have had to scale out the cluster to increase the number of series that we would allow for. So just to give a little bit of a better sort of explanation of that, it used to be the case that you had about 1.5 million series per 16 gigs of RAM per replication factor and that is no longer the case. And you can have in the billions or hundreds of millions of series on a single node and then across clusters. So it used to be you needed to scale as many nodes to be able to accommodate that kind of workload and you can now do that with not as many nodes.
Michael DeSa 09:32.511 There are some downsides to enabling TSI—is as soon as you do that you’re going to see increased CPU load, and that applies to both the open source and the closed source offerings, and you’re going to see increased disk I/O. Not by much, but a little but a little bit. So just things to be aware of. And again, just to give you a final kind of picture of how all of these components come together just in case you’re not familiar with the TICK Stack. Our offerings, we have Telegraf, which is the agent for collecting and recording metrics and events that can feed data on into InfluxDB or Kapacitor. InfluxDB is the database. It’s the durable storage for the data that comes through the TICK Stack. Kapacitor is a real-time stream processing engine, so it processes data that’s coming into InfluxDB. And then Chronograf is the way that you visualize the data that is in your instance, or manage and create alerts that are running in Kapacitor. I believe I forgot to mention that Kapacitor can be used to trigger alerts, so probably the most common way people use it is doing alerting on certain criteria.
Michael DeSa 10:50.366 So we’re going to talk a little bit about the InfluxEnterprise architecture and then we’ll go kind of through the installation process. So the installation process has three components. So you have the first component is the data nodes, second component is the meta nodes, and then you have Chronograf, which you use to kind of interact with these things. So to run an InfluxDB cluster, you only need the meta and data nodes. You do not need Chronograf. Chronograf is just kind of little bit of icing on the cake, so ways you can kind of administer the rest of the cluster. So as I mentioned, there are two types of nodes that are meta nodes and data nodes. Meta nodes keep state consistent across the cluster. So what type of state am I talking about here? That type of state is our best state that it keeps consistent is users, databases, continuous queries, retention policies, and shard locations and servers. So the one thing that you should note here is the actual time series data is not held in a consistent state. So that’s why we’ve kind of split things out into these two independent components. There’s a consistent part of the cluster and that’s the things that I’ve listed here: users, database, continuous queries, and so forth. And then there are the data nodes, which are held in an—where we have data in eventually consistent state. And so that’s kind of the way you should think about it. In order to achieve high availability for a meta node cluster, you need to have at least three meta nodes, and then it is very important that you always have an odd number of meta nodes in your cluster. So in the case of a three meta node cluster, you are resilient to the failure of one meta node, and if you’d like to be more resilient to that, you can increase to five to be resilient to two meta nodes going down. So something to be aware of whenever you’re designing a cluster. Typically, we see users that have three, maybe five, meta nodes in a cluster but typically three. And the important thing to know about the meta nodes is they don’t need a ton of resources. They are mostly just keeping state of consistent across the cluster and there’s very few requests that are coming through them. So, for an example, in our cloud service, I believe we run these on T2 smalls.
Michael DeSa 13:35.741 So then we get to data nodes. Data nodes actually store the actual time series data. They respond to queries. They do not participate in consensus, so their data is not held in a strong and consistent state, and in order to have a highly available cluster, you need at least two data nodes. So where you would need two data—two, three, an odd number of meta nodes, you would have any number of data nodes. So you can have two, three, four, even or odd is perfectly fine. And data nodes are the nodes that need a large number of resources, and what do I mean by a large number of resources? It can be really anything. It kind of depends on your workload but typically, we recommend at least 8 cores and at least 16 gigs of RAM for your data nodes, unless maybe your workload is a bit smaller and you just need the high availability. So in that case, you may be able to get away with something a bit smaller.
Michael DeSa 14:36.763 And then finally, we have Chronograf, which is the way that you can do user management. We got some prebuilt dashboards. We got custom dashboards, database management, and retention policy management all for the Enterprise part of things, which is nice to have. In order to have a complete InfluxEnterprise installation, you should have a dedicated InfluxDB instance that monitors the InfluxDB cluster. So this is how we—the InfluxDB cluster is going to be the thing that’s monitoring the rest of your stack, and we recommend that you have a separate InfluxDB node that is used to monitor the monitor—or monitor your cluster, right? So just in case anything goes wrong, we want to be able to know what’s happening on each of those nodes. We want to see if there’s memory utilization problems or throughput problems or network problems, and the only way we can do that is to have a separate InfluxDB cluster that is monitoring your—or separate InfluxDB node instance that is monitoring your cluster. So separate InfluxDB instance, we put a Telegraf on each of the various nodes in the cluster and then we have Chronograf for doing visualization of that data.
Michael DeSa 16:00.334 Some general cluster advice. Whenever you’re building a cluster you’re going to want to put a load balancer in front of all of your data nodes so that queries can be spread across each of the nodes in the cluster. So one thing that we see sometimes is people will pick one node and then do all their writes to it and just have the data be spread across the cluster through the node doing its replication. You can do this. This is not something that we recommend. The main reason for that is you’re going to see one of your nodes extremely over-utilized and your other nodes underutilized, and for that reason, we recommend you put an LB in front of all of your data nodes so that requests can be spread across them. Next thing to be aware of is whenever you’re designing a cluster having a higher replication factor, so more copies of your data, will result in lower query latency, so your queries will be faster, but higher write latency, so your writes will be slower. And that kind of depends on the consistency that you decide to write your data with. But typically, if you don’t specify anything it does consistency any, which will be just as fast but you’ll see your general cluster slow down a bit more. And so for that reason, we recommend that if you want to have very fast queries and you’re okay with a little bit slower writes, you have a higher replication factor and if the alternate is true then you want to have a lower replication factor. So as I mentioned here, the minimum requirements for a data node are 4 cores on a CPU and 16 gigs of RAM, and then for meta nodes we really only need 1 CPU of core and 2 gigs of RAM, but you do need to have an odd number of meta nodes in your cluster and that is a very key point. So data nodes, you can have as many as you want. Meta nodes must have an odd number.
Michael DeSa 18:10.107 So the next thing that we get to here are shards. So the data in the database is sharded by time. And so in a cluster, all of the meta nodes have their state and then the data nodes have the particular shards. And a shard defines a particular time block and the shards are the things that we distribute across the cluster. So when I’m talking about a replication factor of 2, or a replication factor of 3, I’m saying that a particular shard exists in one or two places, as that mentions here. So if I have a replication factor of 2, it means that shard 1 will be on both data node 1 and data node 2. How you utilize replication factors or how you work with replication factors—you can do it in one of two ways—is if I’m making a brand-new database, I can say create database mydb with replication and I give the replication that I’d like it to have. And that will set the replication factor on the autogen retention policy. And if I already have a database and I want to create a new retention policy that has a particular replication factor, I can say create retention policy myrp on mydb, duration one hour, replication the size of the replication. In this case, we’re choosing one.
Michael DeSa 19:47.371 So when you’re setting up a two-node data cluster, there’s a few things that you’re going to want to do. The first thing you’re going to do is you’re going to get a license key. You’re going to start five machines, three meta node instances, two data instances. And then for the meta instances, what you’re going to do is download the package, configure the nodes, start the nodes, join the nodes. And then for the data nodes, what you’re going to do is download the package, configure the nodes, start the nodes, add the nodes. So the two pieces that are a little bit different there, right, are the—in the meta instances, you’re joining them to the cluster, and the data instances, you’re adding the nodes. So here is just a little bit of what InfluxEnterprise Kapacitor is. Enterprise Kapacitor is just a highly available version of Kapacitor, and what that means is your data will split into two different Kapacitors. Each Kapacitor will generate its own alert, but since they’re joined together, we have a way to deduplicate that alert. So only one alert will end up coming out of the cluster. So the same data comes into each of the nodes, each of the nodes generates an alert, but only one of the alerts ever ends up triggering any kind of action or reaching any kind of user. So, currently, Enterprise Kapacitor is only supported for high availability. We’re working on building a scale-out version of Kapacitor. So if you want to have many, many different Kapacitor nodes that are working kind of independently of one another, doing different kind of workloads, you can scale out your cluster that way. So that is on the roadmap. So now I’m going to move into a brief demo—
Chris Churilo 21:43.103 Why don’t we do some of the Q&A right before?
Michael DeSa 21:46.400 Sure, I think that’s a good idea. So there’s one question that I have here, which is: Can you just use the InfluxDB for storage and use other streaming technology, so like Kafka, RabbitMQ, etc.? And same for Kapacitor, say Spark for processing the stream of data. Yeah. You entirely could. That is something that we see, and that’s why we have the components independent of one another. So the reason why—or where I would kind of start drawing the line is even if you’re already using, say, Spark or Samza and Kafka and all of these things, you can still use the rest of the components of the TICK Stack to kind of just get the ball rolling, right? So one thing that we see a lot of the time is maybe the integration with Spark is a little bit hard to set up or the one with Storm takes a little bit or a user isn’t using Spark or they’re not using Storm and so, for that reason, we don’t want users to have to think about, “Which stream processing system do I have to use?” We have an answer for you. You can just use Kapacitor, right? So it’s not that we don’t intend for these products to be—InfluxData or Kapacitor to be used with other sort of technologies. We are more than happy to integrate and work with them. Well, the way we think about it is if you haven’t made a decision or you don’t want to start using Spark, we want to have something that’s super lightweight that you can just start running with. So you can get up and running and focus on your problem, rather than having to think about, “Which technology I need to choose.” You know that there’s these four components and they all will work well together basically out of the box. And so that’s the way that I kind of think about the TICK Stack and what not. So to answer the question kind of in a simple line, yes, you could entirely just use InfluxDB for storage and then if you already have other components that you use, you can use those for their associated roles. And that’s something that we see many people do. Any other questions?
Michael DeSa 24:33.990 All right. With that, I will assume there are no more questions and I’ll start sharing my terminal. So we’re just going to do a smaller version of the installation of InfluxEnterprise, and so I’m going to be running off of the local versions of the software that I have but they’re exactly the same as the ones you can download in various packages. So one thing to take a look at is this meta configuration file. So a few things to be aware of are the registration URL, and the license key, and the license pass. So as we mentioned, the first thing you’re going to want to do is to get a license key to use. In this case, I have a special version of the software that does not require a license key but you’ll want to put your license key here. So your license key. Alternatively, if you would like to specify a specific license file, you can do that as well. And then if you plan on using any kind of authentication across your nodes, you’re going to want to have this shared-secret and internal-shared-secret. Internal-shared-secret is only applied across meta nodes and then the shared-secret is applied across cluster and with Kapacitor as well.
Michael DeSa 26:16.556 So, as mentioned, the first thing that we’re going to want to do is start a meta node. So here I have started a meta node. In a production environment, what you would want to do is—I want to start that in single-server mode. You’ll want to start the three of these. And then the process from that is you will do influxd-ctl, and then we can just do this here to see all the commands that we have available. We’re going to want to use this, add-meta. So you’ll say, influxd-ctl add-meta, and then you add the meta node address, which in this case, I believe, is local host 8091. So you would do this for each of your meta nodes. If I then do influxd-ctl show, I can see the two meta nodes. In this case, I already have a couple of data nodes that I’ve joined here. Let’s see. The question is: “Do you have Docker image that can be used for deployment?” Yes, we do. They should be listed on the website or once you sign up for a license key. So the next thing that we’re going to do is we’re going to start our 2 data nodes. And so now we can see we have our meta nodes, which in this case, we just have one. And we’re going to add our two data nodes. So to do that, we’re going to do, add-data, and then, localhost. And I’ve just played with the ports here a little bit since I’m running locally. It’s 7088. And then I’m going to do my show command here and I see that I’ve added data node two. And then I’m going to do my final one, show. And what I’m doing each time I issue the show command is I’m printing out the current state of the InfluxData cluster. And so, in this case, I’ve got my single meta node with my two data nodes here. I can then say, influx -host localhost, and we’ll do 7086. Port 7086.
Michael DeSa 29:40.738 All right. So what I can do here is do—all right, let’s do create database mydb. And then we’re going to use mydb and we’re going to do insert x mymeasure. Say, x equals 1. And I’m going to do show shards, and here we can see that the data has been replicated. We can see that we have my database mydb that has owners two and three for that data. There’s a couple of other commands that are useful to know about in the influxd-ctl world. If I want to do something like, say, show the shards, I can show which ones are available, and then if I want to truncate the shards, say if I want to do something like a rebalance or I want to create cold shards, what I can do is an influx-ctl truncate shard. Truncate shards. And it will truncate the shard. So then when I reissue that show shards command, I get my two copies of the shard. There you go. All right. That’s basically what I’ve got in terms of the setup there. Whoops. Let me stop share.
Michael DeSa 31:44.393 So the next question that I have here is: Is any one of the meta nodes a master? Where is the load balancer distributing the query to data nodes? So there is a—the consensus protocol that the meta nodes use is called Raft and there’s a concept as a Raft leader. But who the leader is can change over time. So suppose, for some reason, one of the other nodes goes down, there’s an election that takes place and one of the other nodes will be nominated to become the leader node. And so is any one of the meta nodes—where is the load balancer distributing the query to the data nodes? In this example I’ve done here, there was no load balancer. And as I write data, it will just kind of be a split across the various nodes. So write comes into data node A and what will happen is it’ll attempt to write the data to data node B if it sees that data node B has a copy of the data. If that fails, it’ll go into a hinted handoff, or if you had consistency say, all, that request would fail and come back to the client. That answer your question, Raj? Awesome. Let’s see. Are there any more questions? Happy to answer anything else. Doesn’t even need to be cluster-related. I’ll answer anything at all.
Chris Churilo 33:44.777 So I think you guys should definitely take advantage of the next few minutes. We’re going to keep the lines open. As Michael said, he’ll answer any question related to any components of the TICK Stack. This guy has done a lot of training, so he’s seen a lot of different customer configurations. So I would take advantage of his brain right now.
Michael DeSa 34:09.247 Can you hear me? Hello? Can you hear me? Okay. Good. Cool. For some reason, my mic’s not showing up as working, but—when will Telegraf 1.5 be released? I believe 1.4.1 was just released. If I had to take a guess, I’d say somewhere in the next one or two months. Is there any particular reason why you need Telegraf 1.5? Is there a specific feature in there? I’m not sure what’s on the roadmap for 1.5 actually of Telegraf. But if I had to take a guess, I’d say some time in the next two months. I see. I don’t know the timeline for that making it into it. You should definitely ask in our community site, and we can give a lot better kind of updates around sort of when we can imagine something will land.
Michael DeSa 35:41.109 Any more questions?
Chris Churilo 35:50.660 Okay then with that, I think we will conclude our session today. Thanks everybody for joining us. I will post both the slides and the recording for everyone. Look, if you have any other questions, please do post it in the community site. I will also make sure that I post that 1.5 question in the community site. And thanks again for attending, and we hope you do really successful things with InfluxEnterprise. Bye, everyone.
Michael DeSa 36:27.210 Bye.