In this webinar, Michael DeSa will provide an introduction to the components of the TICK Stack and a review of the features of InfluxEnterprise and InfluxCloud. He will also demo how to install the TICK Stack.
Watch the webinar “Introduction to the TICK Stack and InfluxEnterprise” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Introduction to the TICK Stack and InfluxEnterprise.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Michael DeSa: Software Engineer, InfluxData
Michael DeSa 00:00.705 Perfect. Awesome. So as Chris mentioned, today is the introduction to the TICK Stack and InfluxEnterprise. And so what we’re going to be covering today is: What is the TICK Stack? So we’ll talk about each of the components, the things they do, their roles and responsibilities. Then we’ll move into what is InfluxEnterprise and InfluxCloud and what are the sort of differences between them. And then, sort of final thing we’ll close it all out with is a demo of the whole TICK Stack in use. So that’ll be each of the components, which we’ll kind of go over in just a moment. So just to give you a little bit of an idea of who InfluxData is, is we’re a modern Open Source Platform, built from the ground up for metrics and events. So this is where we kind of see the world going, and how we see the space growing, is there’s a lot of high-volume real-time traffic. And this can come in irregular data, which are kind of like events, random sort of things, sort of happening at different intervals, as well as sort of regular metrics. So things that come in at a fixed known interval. And then with this kind of new workload, it’s very common that you maybe only care about the last month’s worth of data, or the last year’s worth of data. And it’s not something you want to be able to keep indefinitely. And so we’ve kind of designed our system to make dropping large amounts of data from a certain period, very easy to do. We’ve started moving into the space where time-based queries are becoming more and more popular, so you want to ask questions based off of some aspect of time. How many events did I see in the last hour, or the last minute? And this is sort of different from the types of relational workloads that we had seen previously. And then we also see a world where we need things to be able to scale up, and scale to millions and millions of writes, and billions and billions of individual series or elements that we’re tracking. That’s kind of just the way that we see the world, and that’s what we’re kind of building towards. It’s the space where we have this new workload, we have time-based queries, and we need to scale out performance.
Michael DeSa 02:26.735 So to start, we’re going to cover each of the components of the TICK Stack. And we’ll kind of start with this picture here. And just to give you a little bit of an idea, TICK is Telegraf, InfluxDB, Chronograf, and Kapacitor. So T, I, C, and K. So just to give you kind of a little bit of an idea of how all the pieces come together. Off to the left here we have Telegraf, which is collecting metrics—things like database statistics, system statistics, things about message queues, various application statistics—and then it feeds that data into the rest of the TICK Stack. So some of that data can go on to Kapacitor, or it can go into InfluxDB. If it’s going into InfluxDB we’re sort of putting it into some sort of durable long-term sort of storage. Whereas if we’re moving it into Kapacitor, we’re going to do some kind of processing of that data in real time.
Michael DeSa 03:27.192 So then we have this kind of interaction between InfluxDB which is our data, which will actually store the raw time series data, and Kapacitor, which will be processing that data. So Kapacitor can process this data in one of two ways, which we’ll see more and more about as we continue on here. It can process that data in batches, which is querying data that is in InfluxDB, and it can also subscribe to any writes that may be coming into InfluxDB. So a write into Telegraf—from Telegraf to InfluxDB will be mirrored into Kapacitor. And Kapacitor is kind of the place where you can think about doing anomaly detection, pulling in any kind of Prometheus-style metrics. You can do machine learning things, have user-defined functions, kick-off alerts to a number of services. It’s really sort of the processing hub of the TICK Stack. And then the final piece off to the top right here is Chronograf, which is the way that you kind of—it’s the UI layer for the rest of the TICK Stack. So you can make dashboards, you can create alerts, you can manage users, you can configure and manage your InfluxDB instances or Kapacitor incidences, all through Chronograf. So just to give you kind of high-level overview of this again. Telegraf is the way you collect data, InfluxDB is the way you store data, Kapacitor is the way that we process the data or downsample the data, and Chronograf is the way that we visualize and manage the rest of the components of the TICK Stack.
Michael DeSa 05:14.121 So why use TICK? Why would somebody want to use TICK? I was actually just talking to a friend about this yesterday—is it’s very, very easy to use. If you’ve ever used other systems, setting them up, managing them, configuring them, is a little bit painful to do. With Influx, or with the TICK Stack, it is very easy. Each of the components works very well together, and it makes sort of the setup and management a breeze. So kind of in line with that is we have no external database dependencies, so you don’t have to manage a separate Cassandra cluster or a separate HBase cluster, and have maybe a layer on top of that where you write your data into it. It’s just the TICK Stack. So there’s nothing kind of external. It’s just InfluxDB, it’s just Kapacitor, it’s just Chronograf. They’re all sort of single binaries that you use in conjunction together. It allows for both metrics and events, being regular and irregular time series. Some systems are specifically tuned to work for metrics or for events. We take sort of the middle ground. We allow for both. We are horizontally scalable, so if you need to scale out performance you can do so. We’re a full platform not just a single tool, and since—I think that’s an important thing to keep in mind there. That goes in line with that ease of use. So all the tools together, using them in conjunction, it really solves the whole problem rather than just parts of the problem that are kind of put together or connected via TICK. Each of these tools was built to work with the other, and therefore it’s a very versatile and very robust tool.
Michael DeSa 07:00.963 And finally, within the sort of metrics and monitoring space, there’s a big debate of pull versus push in the style for collecting metrics, pull being the InfluxDB or the cluster is aware of all of the metrics that it needs to collect, then it goes out and collects them from each of those services on some interval, and push being they all just write to a central location. So on one, the cluster would pull data from a cluster, from all of the things it’s monitoring. And the other, it would have that data pushed to it. And so there’s a big debate around this in kind of the metrics monitoring space, and there’s pros and cons to each of them. And the TICK Stack allows you to do both of them. So if you want to do push, you’re more than welcome to. If you want to do pull, you can do that as well.
Michael DeSa 07:58.819 So revisiting in a bit more detail, that left component that we had talked about there is Telegraf. So Telegraf is again an agent for collecting and reporting metrics and events. It’s plugin-driven, so there’s about 80 plus plugins where you can pull data from or write data to. It can pull metrics from third-party APIs, or listen for metrics on a particular port, or have pull data from Kafka. It has a number of different output plugins, so it can send data to InfluxDB, or Graphite, or OpenTSDB, Datadog, Librato, Kafka, MQTT, NSQ and there’s many, many, many more. There’s a very minimal memory footprint, so it doesn’t consume a lot of memory or CPU. And it’s entirely written in Go, which means that it compiles to a single binary with no external dependencies. And so the way you can sort of think about Telegraf is it’s a nice binary you kind of just deploy out on the edge of your network and have it report your data back to whatever service you’re using as your datastore.
Michael DeSa 09:13.453 One thing I should mention here is if you’re looking—you’d like to start contributing to the TICK Stack, or if you want to get a little bit more familiar with the pieces of the TICK Stack, Telegraf is very often a great way to do this. We accept many, many pull requests. I actually believe that the majority of the plugins that exist for Telegraf were not written by us. We maintain the code, but the plugins themselves were not written by us. So it’s a great place where—say you’ve got a service that you’d like to use and it’s not out there, you can contribute a plugin and sort of get integrated into the rest of the TICK Stack.
Michael DeSa 09:58.427 So now we’re going to move on into InfluxDB. So InfluxDB is the purpose-built Time Series Database. It’s specific, and it’s designed specifically to work with time series data. So time series data workloads—you usually have very large amounts of individual inserts as well bulk reads. So you’re usually not reading out one or two points. You’re usually pulling out many, many points or doing aggregations across those points. So kind of in line with that, we built our own storage engine that allowed for this high ingest and data compression. It’s entirely written in Go and it compiles to a single binary with no external dependencies. We have an expressive, SQL-like query language, that’s easily tailored for query aggregating data. So whether or not that’s grouping things into 10-minute buckets for the last hour and taking averages over it or, say, grabbing the last point from a particular series. All these things are very efficient. Its data model has a way to index data, and that’s through what I call tags, which allows for very fast and efficient queries and retention policies. So you can auto-expire data as it becomes stale. So every, say, every six months, I want to expire five days’ worth of data, I can do so easily. And with this new development that we’ve had, we have a disk-based index that allows us to store billions and billions of series. So this is really particularly important for the container-world where there’s usually containers coming up and down, and you want to monitor each individual container. And older systems, doing so was not particularly great because you would have those sort of short-lived containers—would live in the memory index indefinitely, and so you would get a lot of sort of bloat of memory over time. And so we’ve developed what is a disk base index that allows for this sort of ephemeral time series. By ephemeral, we mean it’s coming up and down in short-lived series that we would monitor. So we’ve just recently developed a TSI, which is a configurable option in InfluxDB. So this is the kind of workload that you’re seeing. We would encourage you to try this out. That being said, it is still currently in a beta stage and has yet to be—all of the bugs have yet to be completely worked out.
Michael DeSa 12:47.227 So one of the questions that I get all the time is, “Well, what kind of performance can I see?” And it’s a little bit hard to give strict numbers around that, but just to give you an idea of what we consider a low, moderate, and high load would be if you’re doing less than 250,000 writes per second, maybe 25 queries per second, and you’ve got a cardinality of less than 100 without TSI, we would consider that in the low category. That’s not a particularly high throughput. That’s sort of a baseline of performance. What we’re starting to creep up here is if we’re starting to do more than 500,000 writes a second, more than 25 queries a second, and more than 100,000 series, starting to get into a moderate load. And then if you’re doing anything above 750,000 writes a second, more than 50 queries a second, and you have a cardinality of greater than 1 million, then we’re starting to creep into high territory. And so just to give you a little bit of an idea of kind of where we place the performance of the product—you may see this column off to the right here, which is cardinality with TSI. So with TSI, this kind of cardinality question is really removed. And it’s really been designed to work for 1 billion-plus series. That should be universal across low, moderate, and high workloads.
Michael DeSa 14:24.263 Next question I get after talking about performance is, “What kind of hardware should I be using for each of these?” Low sort of throughput, low load I would consider maybe 4 core CPUs, 4 to 8 gigs of RAM, and about 500 IOPS. And should note here that these disks must be SSDs. Moderate, 6 cores, 8 to 16 gigs of RAM, and either 500 to 1,000 IOPS. High load you’re going to want 8-plus cores, 16 or more gigs of RAM, and 1,000 IOPS on your SSD.
Michael DeSa 15:03.078 And then next we have Chronograf. Chronograf is kind of an integrated experience for the TICK Stack. It has administrative capabilities. So creating databases, modifying retention policies, user management, alert configuration, Kapacitor management as data exploration. So you can make visualizations of queries and then create custom dashboards off of those visualizations. And we have a number of pre-built dashboards based on metrics coming from Telegraf. So one of the reasons that—or how we talk about how these things are all very tightly coupled together, is that if we know that you’re using—or maybe, loosely, they’re integrated well would be a better way to say that. If you’re using Telegraf and we can see that you’re using the CPU plugin and we can see that you’re using the memory plugin, we know what those graphs should probably look like. And so we can just provide you with a dashboard that sort of shows you that data in a reasonable way for free. And that’s sort of one of the things that we kind of always keep in mind is, how can we get you to value as quick as possible? So I think we have a little bit of an internal motto of, we want to be able to have you go from zero to monitoring in about 20 seconds. And we’re fairly close to that. And then the final thing is down here—you can create Kapacitor tasks. So tasks are our unit of works. We’ll talk about that in just a moment. And you can do this one of two ways. You can set up what I call rules, which are kind of like pre-built templates for particular things. So if something crosses this threshold, or if there’s a relative difference of this value, we have ways to sort of rapidly create those types of tasks. And then on top of that in the new versions of Chronograf we just recently released a TICK Script editor, which TICK Script is a DSL for writing tasks that Kapacitor uses.
Michael DeSa 17:19.889 Next, we’ll talk about Kapacitor. Kapacitor is an open source data processing framework that makes it easy to create alerts, run ETL jobs, and detect anomalies. It’s used to downsample data. So previously we used to have a concept called Continuous Queries in the database. That concept still exists, or that contract still exists, but we’re recommending that people no longer use it, and we recommend people switch to using Kapacitor for doing any of their downsampling. It can query data on InfluxDB on a schedule and then sort of receive that data and process it in a number of different ways. It can do any kind of data transformations. I believe it’s turn and complete. So you can add a couple of tags or modify values, anything you really want to do. And then once you’ve done all that you can store that data back into InfluxDB. It has a way to define user-defined functions. So this could be anything from running your own machine-learning algorithms, or doing some basic sort of statistics to detect anomalies. It’s compatible with Prometheus scrapers, so if you want to start creating scraping Prometheus targets you can do so with Kapacitor. And it integrates with a number of different sort of alerting services and chat services like HipChat, OpsGenie, Alerta, Slack, Sensu, PagerDuty, Slack, and many, many more, or you can generate arbitrary http requests. Yeah.
Michael DeSa 18:52.009 So the next question I get asked all the time about Kapacitor is, “What kind of performance can I get out of Kapacitor?” We consider anything that’s less than 10 tasks, less than 100,000 writes per second with less than 100,000 cardinalities to be a low Kapacitor usage. Moderate is less than 100 tasks, greater than 100,000 writes per second, greater than 100,000 cardinality. And high would be greater than 100 tasks, greater than 300,000 writes, and greater than one million in cardinality. And what kind of hardware does this sort of translate to? Kapacitor is actually 100% in-memory, so it doesn’t really need a disk. You should have a disk associated with it, but it doesn’t do any of its processing, or it doesn’t write a massive amount of things to disks. So you don’t have to worry about disk too much here. But you do want to have a sufficient amount of RAM. If you’re doing a low workload, four to eight gigs of RAM is probably enough. Moderate workload you probably need about 8 to 16. And then a higher workload, you can need 16 plus. And how these all end up working out comes down to the types of tasks you’re doing, how much data you end up buffering on the server, and kind of a number of other factors. So these numbers can vary based off of the types of tasks that you’re doing.
Michael DeSa 20:17.973 So now that we’ve covered each of the components of the TICK Stack—we have Telegraf, InfluxDB, Chronograf, and Kapacitor—we’ll talk a little bit about what is InfluxEnterprise. So this kind of graph here makes it a little bit easier to see. InfluxData, the platform, is kind of the core components that we’ve talked about. The TICK, T-I-C, and K. And then where we start to move into the Enterprise sections of things is when you start to think about things like high availability, or building a cluster for scale-up performance, or you need access control, or you want extra support in what you’re doing, or more manageability of your entire cluster, and incremental backups. So the way that kind of breaks down is, in InfluxDB there’s clustering. The clustering can be used to do one or two things, which is high availability or scalability. So you need to scale out a cluster for performance, you can do so. Or if you just need to have duplicate copies of data, we can do that as well. The next point beneath that is Kapacitor clustering. Kapacitor clustering is separate from the InfluxDB clustering. Currently, Kapacitor clustering is predominately based around high availability. So if you want your alert system to be durable, and process system to be durable, to loss of data, that’s going to be done through Kapacitor. And scalability is in the works, though it has yet to be released for Kapacitor clustering. And we have enhanced backup and restore procedures for the Enterprise, as well as fine-grained auth, enhanced security, it’s battle-tested and hardened, and enhanced corruption deployment capabilities.
Michael DeSa 22:13.777 So what is InfluxCloud? InfluxCloud is everything that you get with InfluxEnterprise, plus Chronograf has sort of a built-in Chronograf that has some statistics about the running cluster. It’s got basic alerting. It has a Kapacitor add-on if you would like to have any sort of high throughput faster—you can do that as an add-on. It’s fully managed by InfluxData. It’s monitored by the InfluxData team 24/7. It’s fully optimized for the TICK Stack. So we pay a bunch of attention, to the specific configuration options and set those appropriately. And we currently run our cloud services out of AWS. So if you’re also running in AWS, it creates a very low network latency.
Michael DeSa 23:06.531 So just to give you kind of an idea of the product offerings that we have, we have these three points up at the top for all three pieces. So those three pieces being InfluxCloud, InfluxEnterprise, and the TICK Stack. InfluxCloud. All of them have an open-source core, they’re all extensible, they all support regular and irregular data. Where they start to differ a little bit is in the open source TICK Stack, you don’t get high availability, you don’t give them scalability, you don’t get back advanced backup and restore functionality, and you don’t get a complete platform support. So if you want support in what you’re doing, you’re definitely going to have to move into the InfluxEnterprise section.
Michael DeSa 23:50.972 InfluxCloud, as we mentioned, is fully managed by the InfluxData team. Enterprise and Open Source are not. And InfluxCloud runs on AWS. Enterprise runs really anywhere you’d like it, on premises or any other cloud provider, and the TICK Stack itself is the same way. You can run it wherever you please.
Michael DeSa 24:15.989 So in that light, I’m going to switch to just a short demo, using a tool that we have called SandBox, so.
Chris Churilo 24:25.003 Hey, Michael. So as you’re getting that set up, maybe we can just answer Sunil’s question?
Michael DeSa 24:30.714 Sure. Yeah. That’d be great. So the question is, “I’m using InfluxDB to store my monitoring time series data. I’m using a separate measurement based on use case, not merging all uses cases into one measurement. I have around 20 measurements, and I’m getting time-outs, exceptions during read operation after four days database—[inaudible] three days equals [inaudible] data. So it’s a bit hard to tell. How many fields are associated with each measurement, I think, would be the question. Yeah. Five fields? And what kind of queries are you doing? Are you just querying all of the data at once? Or how much of the data are you querying? So you’re doing a SELECT * FROM *, pulling all of the data? Or is it something different? So what would probably be the best way to go about this is if we could maybe open up a community issue, and kind of go through, and diagnose it a bit more. If I had to take a guess without seeing the exact query that’s being ran and knowing a bit more about the schema, it’s a little bit hard to tell. But, yeah. Maybe we should proceed by having a community question where we could get this answered.
Chris Churilo 26:52.960 Okay. Are you going to do your demo now?
Michael DeSa 26:56.591 Yeah. I think there was one more question, which was specifically how aggregations handles irregular time series specifically, or aggregations where the GROUP BY TIME is smaller than the interval between two data points. Also, can I subtract two irregular time series missing the values from each other? So again, I think this would be a great question for our community. So if we can get those sort of set up there—it’s just you need a little bit more information to express it, and I think having some textual examples would go a long way.
Michael DeSa 27:36.349 So on that note, I’m going to proceed with my demo. And for this, I’m going to—we’ll just share the whole desktop. Can everybody see my desktop? I hope so. All right. So I’ll make my font bigger. So what we’re going to do here is I’m going to run a—we basically have a few Docker containers and a Docker post file where we can spin up all of the services at once. So I’m just going to run a sandbox up, and it’s going to be launching each of these services, and then it’s going to open up a browser for me. Oops. It’s going to open up a browser for me.
And then I can see this is Chronograf here, and I want to go through—and I can see a number of the hosts that I have set up. As I mentioned, this is a prebuilt dashboard. Since InfluxDB and Chronograf can see I’m using Telegraf, and I have the InfluxDB service and the CPU and system services set up, it can just start immediately graphing and creating a dashboard for me of these services.
Michael DeSa 29:18.885 I can make a custom query. So here, I’m selecting the usage user from Telegraf for the last 30 days, where CPU is CPU total, so I was just creating a nice little graph here. I can look at a number of other metrics. So if I want to, say, look at the disk, and I want to look at the free memory, as I can see, my free memory is gradually going down. Sorry. Free disk is going down. As I mentioned, I can have a CPU dashboard. If I want to add a new cell, I can do so, and I can shrink this one here, move it over there. Oops. I can rename it and make a—let’s do CPU usage. We’ll say user. And then let’s edit the graph. So let’s make this one—we’ll do CPU total, and we’ll do usage user. And I can pick between a number of different graph types. So I can do a line, stacked, step plot, a single stack, or a line and a single stack. Let’s do that one. I can also create rules. So if I go over here I can create a rule. And we’ll click on this Create Rule button, and then I select the specific series that I’d like to create a rule for. And let’s do path is slash, and let’s do free. Let’s do when free is less than, we’ll say 50.
Michael DeSa 31:50.229 We’ll do another deadman. So if data stops showing up for one minute, we’re going to want to make an http request to http://local host3000. And once that’s done I can save the rule, and if I want—so that’s all the rules. I can click here to enable or disable the rules. I can do database management from here. So if I want to create a new database I can say my DB two, and then it creates a database with the associated retention policies. I can create users, and say new user, and then I can apply permissions to that specific user. And then I can see what queries are currently running and then I can kill any queries that I’d like. And if I’d like to configure a new InfluxDB, I can do so here. And I can set the default Telegraf, and I can make it the default source if I’d like to.
Michael DeSa 33:39.423 All right. I can also configure a Kapacitor node if I’d like to. Currently, we don’t have anything running at localhost and so that’s why we are getting these errors. And I think that brings me to the end of my chat here.
Chris Churilo 34:04.190 Excellent. So we still have a few more minutes. So if anybody on our training today has any questions, feel free to put it in the chat or the QA panel and we’ll get those answered to you. We will send out a link later on with—well, an email with a link to this webinar so you can take another listen to it at your convenience. And then if you do have any other questions, we do encourage that you go to community.influxdata.com, and you may already see your question there already answered. Or if not, just go ahead and start a new thread and our engineers will be right on top of that. So we’ll stay here for just a few minutes more and start answering some of the questions that I see coming in.
Michael DeSa 34:49.998 So we have a question here—where do I get Sandbox? And I will send that to you right now.
Michael DeSa 35:37.844 Just saw now, Mark, that my Chronograf was very small. I apologize for that. I didn’t see that comment during the runtime, so I’m sorry for that. Any more questions?
Chris Churilo 36:01.600 All right. Looks like we got a kind of a shy group today. In any event, I will send the email with the link to this webinar, so you can take another listen to it. And if you do have any questions please post them on community.influxdata.com. And if you have any feedback on our training, if there’s some topics that you’d like us to cover, just reach out to us and we’ll be more than happy to create those trainings to get your questions answered. All right. Well with that, thank you Michael. It was fabulous, like always. And I hope everybody has a great time playing with the TICK Stack or InfluxEnterprise. Thanks, all.