Migrating Graphite to InfluxDB
Recorded: January 2017
In this webinar, Jack Zampolin will provide you with a step-by-step process to migrate from Graphite to InfluxDB.
Watch the Webinar
Watch the webinar “Migrating Graphite to InfluxDB” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Migrating Graphite to InfluxDB.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Jack Zampolin: Developer Evangelist, InfluxData
Chris Churilo 00:03 And, Jack, as we’re getting prepared, here, I just thought I’d share with you a little note from Sean Whitney, who is one of the participants, one of our attendees today. And he says, “Thanks for the invite. The timing couldn’t be any better.” So I’m really pleased for you, Sean. And just to let you know before we get started that Jack has done many of these migrations. So feel free to ask a lot of questions in the Q&A panel so everyone else can see. You can also continue to use the chat panel. We’ll make sure that we get all these questions out and answered. But we want to make sure that these webinars are as useful as possible for everybody. And, like I mentioned, we have them recorded, so you can review them again if needed. Okay. We can probably get settled in here and kick off our webinar. So today, our webinar is all about migrating your Graphite database to an InfluxDB solution. And we have Jack today that will be presenting. And Jack has the ball, so I’ll go ahead and let you introduce yourself one more time and get started.
Jack Zampolin 01:13 Awesome. Good Morning, everyone. My name is Jack Zampolin. I’m a developer evangelist over here at InfluxData, and today we’re going to be talking about migrating to Influx from Graphite. So if you click on the Graphite migration tab in the WebEx Event Center, you should be able to see the slideshow, and that’s going to be the best spot to follow along. I will also be manning the Q&A as well as the chat channel. So if you have any questions, please feel free to drop them in the Q&A or the chat, and I’ll try and answer them as I get to them. So let’s get started. So today we’re going to talk about migrating from Graphite to Influx. There’s a lot involved in this topic, but first, we might want to talk about why we would make that switch. And then we’re going to talk about different migration strategies. I’ve done probably 20 of these with different companies that we’ve worked with. I’ve seen some patterns here and different ways to do it, so we’re going to talk through those. And then I’m happy to answer any questions. This topic generally brings out people who are currently undergoing Graphite migrations or considering one, so if you have some ones that are specific for your use case, please save those till the end, and I’d be happy to answer them. Okay.
Jack Zampolin 02:53 So starting off, what is Graphite? A lot of people think Graphite is a database, but that’s called Whisper, or any number of other things, a web dashboard, and those things are part of the Graphite ecosystem. But Graphite itself, I like to think of as a protocol that describes time series data. So it was designed at Orbitz in 2006, open sourced in 2008, and it’s kind of all over the place now. And that protocol is period-delineated series keys with individual values and timestamps. Now, that’s sort of the architecture that RRDtool spearheaded. And then, it was also an easy way to design time series databases in a relational database. If you want to do time series in a relational database, you’re going to want to index on time, and that’s very expensive. So you want to make your table as small as possible. And the smallest possible table is a series key that we’re also indexing, a timestamp that we’re indexing, and then a value that we’re not indexing. So that’s kind of how you’d go about building a Time Series Database in a relational database, and Graphite mirrors that structure. So what is it? What is it primarily used for? Host-level monitoring and application monitoring. Because it is a very simple data format, it’s pretty easy to store whatever kind of metadata you need to generate a time series and then graph the resulting points. So you can score application performance metrics in there. You can store host-level metrics in there, sensor data. People use Graphite for all kinds of things. So in addition to the protocol and the database, it’s also an ecosystem of tools. And I would include things like StatsD in that too.
Jack Zampolin 04:56 So what is Influx? It’s a Time Series Database, but Influx is also part of a full stack of tools developed by the company InfluxData to work with time series. So visualization, we have a tool called Chronograf. Most of our users, and I would say probably most Graphite users, use Grafana as their visualization tool, but we produce one as well. Collection, Telegraf. The comparable tooling in the Graphite ecosystem would be like the Diamond collector or CollectD. Those are common collectors. And then alerting or ETL, and we’ve got Kapacitor for that. In the Graphite ecosystem, you would be looking at things like Nagios or the new Grafana alerting that they just released. And we’re built as a monitoring tool for any time series, so DevOps, finance. You can think of the massive amounts of time series data that are produced by the stock market every day. Internet of Things applications, so if you’ve got a bunch of sensors, and they’re feeding data back, that’s all time series. And then application performance monitoring.
Jack Zampolin 06:09 So what are some common questions that lead people to a Graphite migration? Should I buy or provision another really big machine to run a Graphite cluster on and properly scale it? Is this strategy sustainable or cost-effective with my projected scaling needs? This is one that I see a lot of people run into. Graphite is very hardware expensive, and we’ll dig into some specifics on that in a minute. But people will generally reach a point where their infrastructure can no longer support their growing application, and they need to scale that. And at that point, they’ll step back and say, “Is Graphite the right tool for us now?” If I decide to change datastores, how much effort will go into making this migration? So how much effort is it to migrate backends? I mean, if you’re migrating from something like a JSON-based storage system to Oracle or vice versa, there’s a lot of friction there, so how much friction is there between different Time Series Databases and, specifically, Influx and Graphite?
Jack Zampolin 07:17 Can I migrate the data from my current system into a new database? So would I be able to bring my data with me, or would I have to throw it out? Can I offer an API-compatible solution to clients? So will I have to rewrite my applications to emit different types of metrics? And will switching databases save me money on hardware? If you’re using Graphite now, you’re currently spending a lot of money on hardware. And then, can I continue to use Grafana or the Graphite dashboard for visualization? And these are kind of the common questions that people ask around a Graphite migration. And we’re going to get into specifics on each of these. Also, and I just want to reiterate, if anyone has any questions, please drop them in the Q&A or the chat, and I’ll answer them in the course of the presentation.
Jack Zampolin 08:14 So what are the major Graphite pain points that drive people to migrating? One is Graphite’s clustering architecture. So this picture, if you Google Graphite clustering— and I’ll drop this link in the chat—you’ll see this. There’s one, two, three, four, five, six, seven different processes that go into running a standard Graphite installation. A single Whisper daemon can only really handle around 70,000 series and a very small amount of write metrics. So if you’re running Graphite at any kind of scale, you’re going to need to learn how to do the clustering, build some sort of automation around that so that you can scale up and down the cluster. Graphite splits data with consistent hashing, so every time you change cluster configuration, you’re going to have to rehash. So as you can see here, there’s a number of issues with the Graphite clustering architecture, and this is a major pain point for many people.
Jack Zampolin 09:25 Another is hardware costs. Graphite requires expensive hardware to run. You’re going to need to run on SSDs with very high IOPS, so generally around 20,000. Influx uses considerably less IOPS, and while we do recommend SSDs, you can run the database on spinning disk. And there’s about a 50% performance decrease there for most workloads. But that’s a price a lot of people are willing to pay with Influx because we do have such great performance. So if we look at these recent benchmarks, Jason Dixon in Graphite released a benchmark in September of 2016. I’m going to toss the link in the chat here that they did on AWS. The machine that they used was an i2.4xlarge. That’s a 16 CPU, 122-gigabyte RAM machine, and they had some EBS volumes with provisioned IOPS. And they requested the maximum lineup so that was 20,000. Monthly, that machine costs $2,500 roughly, and the performance they were getting was a write throughput of around 60,000 metrics a second across 600,000 different series. And you can go peek at those benchmarks there.
Jack Zampolin 10:53 Influx, the most recent benchmark we’ve released was a benchmark against OpenTSDB. I’ll also drop that link in the chat here. And the write throughput with Influx, you would expect it to be higher, and it was. We were seeing a throughput of around 180,000 metrics a second, but that was across a smaller number of series. So I’d say the performance on both of these tests is roughly comparable, but you see the huge difference in hardware there. We were using two 4xlarge machines which are four CPUs, 16-gigabyte ram machines. And then the test was taken on the Instance SSDs. So we didn’t even need to get provisioned IOPS, none of that. And the monthly machine cost there is around $350. So you can see there’s a tremendous difference in cost to run your infrastructure there, and that is a major pain point for most people running Graphite.
Jack Zampolin 12:02 And then another one is the data model. So we talked earlier about why Graphite’s data model is the way it is. That’s the way Time Series Databases have been designed for a long time. So what are the weaknesses of that data model? There’s no tagging. The period-delineated string that you have has metadata in it, and you can think of those as tags. But that requires regular expressions to query, which is going to be expensive no matter which backend you’re using. In addition to no tagging, you only really get one value per measurement or series key that you’re using, and that’s not great. That leads to excessive retransmission of repeated data. So if we look at the example there, there’s three Graphite points. We’ve got an environment, a host name, and then maybe some sort of—that host name is in three period-delineated sections there. And then we’ve got CPUs, which is what we’re measuring, and then essentially what we would call a field value in Influx, user, nice, and system. Also, another bad part about that is in Whisper, each key is stored in a different file, so you get to a point where you have to scale your cluster out to get more open file handles. And that’s not a situation most people want to be in. And that brings you back to the clustering architecture that we talked about earlier.
Jack Zampolin 13:33 And then if we contrast that data model with line protocol, the line protocol data model is measurement comma tagset space fieldset space timestamp. And if you’ll notice tag set and field set, you can have as many tags and fields as you want on any given write. The difference between tags and fields is tags are indexed, fields are stored on disk. And Influx allows for multiple fields per point. And if you see that example point down there, it’s the equivalent of the three above points in the last slide here. I’ve just translated this into Influx. And you can see that it’s fewer characters, and if we had maybe 10 of these, which would be the case if you’re running the Sensu system check plugin, they would all crush down to one metric. And that’s pretty significant savings over the wire, as well as storage, so pretty nice there. And you see here that’s measurement CPU and then a sequence of tags there, and then we’ve got our three fields with all the data in that. This leads to a much more sufficient data transmission.
Jack Zampolin 14:48 So now that we’ve seen why people would want to switch from Graphite and sort of undergo that painful migration process, what are the different paths people take? And I see three different paths that people normally walk down. One is just replace your Graphite backend with InfluxDB. So this is sort of the lowest impact—this is sort of the lowest impact of the migrations. So what are the advantages there? Doesn’t require any infrastructure changes. You don’t have to change where you’re collecting your data. Any applications that you have instrumented in StatsD or Graphite protocol don’t need to be refactored. There’s no new protocols or code for your developers to learn. It’s just literally changing your datastore.
Jack Zampolin 15:44 The disadvantages there is you’re not taking advantage of any higher-level InfluxDB features. Tagging is a pretty key feature in Influx, and querying with our query language, you’re going to end up using tags a lot. If you’re just using Influx as a Graphite backend, you’re not going to be able to take advantage of that. Another main disadvantage, tooling for this path is mainly community-supported, and maintenance is spotty. For example—and we’ll see this in the next slide—there’s a shim that allows you to write Graphite queries for Influx. This is fighting the framework, essentially, and if you’ve ever worked in Rails or another large framework, you know what fighting the framework feels like. It is not fun. Also, this writes a very nonoptimal schema to Influx. It requires regular expressions for effective querying within the environment. This leads to slower queries, higher memory usage, and slower writes.
Jack Zampolin 16:44 So if you were taking this migration path, what tools would you want? InfluxDB offers a Graphite service. So this just takes in Graphite points and writes them directly to the database. So this would allow you to use Influx as a direct one-to-one replacement for Whisper. You wouldn’t even have to change to the way your clients are emitting data. Telegraf, our collection agent, also has a Graphite service as well. So if you need to need to forward data from, let’s say, one VPC to another, you could use that Telegraf Graphite service to take care of that or any of your other Graphite infrastructure there. And then I mentioned that query shim earlier, and there’s the link to that. I’ve seen a couple of clients use that. It’s not very well-maintained and does lead to writing some really kind of sketchy schema. So just a question, a quick poll here, is anyone here considering that type of migration where you’re just replacing the backend? I’ll pause for a minute to allow you to respond. So Sean said he’s considering this type of migration for some services.
Jack Zampolin 18:06 And that gets to the next type of migration we’re talking, sort of a staged migration. That first case is sort of something that a lot of people don’t do. I’ve seen one or two clients who implemented that, and it really didn’t go well for them. And they ended up having to do sort of a staged migration later and get rid of a lot of this tooling that they had built up. So how would you accomplish a staged migration? The advantages here are, it doesn’t require any immediate changes to your existing infrastructure. Any applications that you have instrumented in StatsD or Graphite don’t need to be refactored. Developers can learn new protocols and code at their convenience. There’s many well-maintained tools, including tools that we’ve written and maintained, available to help with this transition. You get the advantages of the cost savings. There’s a considerable decrease in the amount of system upkeep you have to have at the database layer. We were talking about the clustering earlier, and this allows you to leverage the tagging in multidimensional data structure in Influx immediately for your new use cases where you require it. And it allows you to slowly transition your older data over.
Jack Zampolin 19:27 So what are the disadvantages of this approach? Legacy applications may need to be refactored. The way that they emit metrics in Graphite protocol, if you’re switching fully over to line protocol, you’re probably going to want to—you’re probably going to want to use line protocol eventually for all of your applications. And if you have legacy applications that are emitting in Graphite, it is going to require a refactor of those. One of the first things you’re going to do is you’re going to need to update infrastructure with new collectors. If you’re using CollectD, the biggest gain moving from Graphite to Influx for host-level metrics is switching from CollectD to Telegraf, and we’ve got a number of canned dashboards for Telegraf. It’s got a lower memory and CPU footprint than CollectD. There’s a number of advantages there, but if you don’t have a fully-instrumented deployment system, it might be difficult switching that infrastructure. Another disadvantage is you’re going to need to learn the Influx query language. There’s many tools out there to do that, and if you use Grafana already, Grafana has a query builder for Influx that allows you to do that. So it’s not a huge problem, but those are the disadvantages to this migration style.
Jack Zampolin 20:49 So what tools would you use to accomplish this migration? Telegraf has a lot of utilities for accomplishing this type of migration. Telegraf is our data collector, and I think of it kind of like a Swiss Army knife or a multi-tool. If you have a piece of infrastructure that you need monitored or some metrics coming out of somewhere random, Telegraf probably has something to help you out. It seamlessly collects system—and application-level metrics from any part of your stack. So as far as system metrics, we’ve got the standard CPU memory and a number of other things. And then application-level metrics, we have plugins for pulling common pieces of infrastructure such as databases and queues. There’s a StatsD server; it can act as a metric forwarder for Influx metrics. It’s able to parse Graphite in output line protocol, which you can also do with the Graphite service on InfluxDB. And existing Telegraf plugins write efficient and performant InfluxDB schema. So that’s why it’s one of the biggest wins when you’re migrating, is switching over to Telegraf because if you’re using Sensu, or CollectD, or another Graphite collector, all of your metrics are in that Graphite style there. And switching over all of your system-level metrics from that Graphite style and switching over to Telegraf allows you to instantly transition all of your system metrics into the Influx style. And you’re going to see increased ease of query ability, increased visibility into your infrastructure, and ease of use, really, is another huge one.
Jack Zampolin 22:38 And down below that Telegraf section, I’ve got how Graphite parsing works. Now, it’s kind of out of scope for this presentation to dig deeply into this, and if anyone has questions about this Graphite parsing at the end, I would encourage you to ask. But the long and the short of it is it takes Graphite points. So as we see there, prod sequencer 142, ingen dot com, CPU user, and then one value and a timestamp there. It’s the points that we were looking at earlier, and it outputs line protocol. And to do that, it passes it through a template. And you see that quoted string there is the filter and the template. The first prod dot star, is the filter. So any Graphite metrics that match this, prod dot star, we want to pass to the following template. And it takes each of those little period-delineated sections there, and you can name it as tags. So for prod in those three host strings there, we’re adding those as tags. So prod’s our environment. And then sequencer 142, ingen dot com is our host name. And then we’re also pulling out the measurement name out of here. So we see CPU is our measurement. And then our field is—everything after CPU we’re going to consider our field name. So if we had three points, user, nice, and system, that all look like this, and we passed it through that template, what it would output in line protocol is the following. And we see CPU environment equals prod, host equals sequencer 142, ingen dot com. Our user value, our nice value, and our system value all under the one timestamp.
Jack Zampolin 24:36 You can also use this same templating format on StatsD points, so if you have some StatsD period-delineated strings that you want to kind of break down, we can do that, too. If you’re using Datadog, Datadog has tagging, so you might want to import those tags that you’re using on Datadog and then also maybe parse some additional data out of your period-delineated metric names. So whatever form of Graphite or StatsD you’re migrating from, we can help you with this tool here. So this is excellent for, as in Sean’s case, some of those staged migrations, where your net new projects, you’re onlining with Influx, but you’re slowly migrating your existing environment. So before we talk about replacing all Graphite metrics in your environment, can we talk about anyone who’s currently undergoing or interested in sort of that staged approach to migrating? Also, if you have any questions, now would be a great time to ask.
Jack Zampolin 25:47 So the final type of migration we’re going to talk about is a rip and replace, so a full replacement of all Graphite metrics in your ecosystem. So what would we need to do that? What are the advantages and disadvantages? The advantages there are you can use the cutting-edge features of the InfluxData TICK stack immediately, adding something like Kapacitor for alerting, or using the new version of Chronograf to write Kapacitor alerts, or use some canned dashboards for Telegraf plugins that you have running. That just works out of the box without any additional configuration. There’s many well-maintained tools available to help with this transition. All of the things that we’ve talked about earlier, as well as some other ones. So you get the cost savings. You get the decrease in the amount of upkeep that you have to do on the system. You can leverage tagging in the multidimensional data structure in Influx. And you can increase your data resolution, so write more data into the system very easily. A lot of people use Graphite if their servers become under load. As opposed to having to rehash or grow the cluster by adding more hardware to it, they’ll decrease the data resolution. I don’t need 10-second resolution data. All I need is one-minute or five-minute resolution data. And switching to Influx will allow you to get that resolution back and increase the load on your system very easily. A single server, so the open source version of Influx, can handle hundreds of thousands, up to almost a million writes a second. So it is very performant.
Jack Zampolin 27:43 So what are the disadvantages of this? It may require substantial changes to client metrics. So if you don’t have control over the clients or maybe some of your clients are controlled by different departments, it might be difficult to ask them to change. In that case, you’re going to want to look back at the staged migration and sort of look at some of those tools. It requires developers to learn new tooling immediately. So in the sort of staged migration case, people can kind of move slowly into the new Influx thing and work a couple projects with it before they commit to a switch. It also requires immediate changes to infrastructure. So if you’ll notice here for smaller more agile software teams, a lot of these requirements are not disadvantages. So if you are in a place where you can control your clients, your developers are excited about using new tooling, and you can easily change your infrastructure or something like any modern DevOps environment, Ansible, Kubernetes, any of the above, these disadvantages kind of melt away, and doing the full rip and replace is actually better.
Jack Zampolin 29:10 So what tools would you use to do that, that we haven’t talked about already? One would be our API client libraries. And I’m dropping a link to the docs for that. We’ve got Go, C#, Java, PHP, Ruby, Python, Node, Rails, and then a number of other community-contributed ones. These client libraries will allow you to emit metrics from your applications directly to the database. You could write your own Telegraf plugin. Let’s say you’ve got a custom application that you need to pull. It’s got some JSON that it emits from the status endpoint. Telegraf, being the Swiss Army knife that it is, might have an existing plugin that can help you, but if it doesn’t, writing your own Telegraf plugin’s extremely easy. It’s by far the best intro to Go project I’ve ever found. When I was starting off, I wrote a number of Telegraf plugins to learn Go. It’s a nice introduction to concepts like interfaces and a number of other things. And once you write it, it’s something you don’t have to worry about anymore. And we’ve also got a number of community tools. One of our developers maintains a repo called Awesome InfluxDB, which is a list of cool, community-written tools to work with Influx in different environments. So if you’re looking to work with a random piece of infrastructure, and you don’t find something immediately, check on Awesome InfluxDB, see if someone’s written something for it. And Mark does update that rather frequently.
Jack Zampolin 30:57 So there’s a couple things that didn’t quite fit in that narrative there, and it’s other data sources and tooling. So let’s say you’ve got a bunch of sensors or some special requirements, how do we take care of those? So other tooling, if you have a queuing system that you’re hooking into in your existing environment, Telegraf acts as both a consumer and a producer for the following queuing systems. So what that means is you don’t have to write any consumer or producer programs for this metrics data. You can just pass it directly through your queue and have Telegraf sitting on both ends taking care of it. And the queuing systems we support are Kafka, RabbitMQ AMQP, NSQ, MQTT, and NATS. And I believe we also have a Kinesis output plugin. So a producer for Kinesis but not a consumer for Kinesis. Especially in high-volume use cases, many customers find this extremely useful, and it saves them a ton of time.
Jack Zampolin 32:03 So we also have—when we’re now switching gears, we’re talking about device polling. We have the Telegraf SNMP plugin. It’s a performant SNMP polling program. It’s used at scale in a number of large telecommunications companies. And by scale, I mean 10,000-plus devices. So that’s a highly effective tool that we offer. Also, another tool that I feel remiss in not mentioning earlier is the Whisper-migrator. So if you have an existing Whisper database and want to migrate your data, you can do that with the Whisper-migrator. So Elliot just asked a question. What’s your recommendation if we’re open to new tooling but have been using Logstash for log analysis? So if you’re looking to use InfluxDB for logs, we do have a couple of things. One would be the Telegraf logparser. So you use that same groks and tags as Logstash, but you’re parsing key value pairs out of there and inserting those in Influx. So that’s one way to do it. We’re also working on more ways to support moving logging over to Influx. Does that answer your question, Elliot? Awesome. So that’s the end of my presentation, and now’s the time for questions. Please drop them in the Q&A or in the chat, and I’ll stick around here till around 11:00 I think.
Chris Churilo 33:49 Thank you, Jack. And as you mentioned, please feel free to put in your questions in the Q&A, or the chat window is fine, too. And we’ll stick around for a few minutes to wait for your questions. And as I mentioned earlier, this is recorded. So I will post this at the end of the day, so you can take a look at this again.