How FuseMail Keeps Their Secure Email Service Performant with InfluxDB
Session date: Apr 03, 2018 08:00am (Pacific Time)
In this webinar, Dylan Ferreira, Lead Systems Administrator at FuseMail, will share how he is able to keep their cloud-based secure email service performant by collecting and acting on email latency statistics. They use InfluxDB to store their high cardinality metrics gathered from their custom mailers to the tune of over 1.5 million time series per database. Dylan will detail the aspects of their test and selection process and how they chose InfluxDB. He will then review their setup and dataflow going into the details of the schema design, retention policies, and continuous queries. He will end his talk with a description of how they are able to use Kapacitor in a recent redesign to satisfy the upcoming GDPR requirements.
Watch the Webinar
Watch the webinar “How FuseMail keeps their secure email service performant with InfluxDB” by filling out the form and clicking on the download button on the right. This will open the recording.
[et_pb_toggle _builder_version="3.17.6" title="Transcript" title_font_size="26" border_width_all="0px" border_width_bottom="1px" module_class="transcript-toggle" closed_toggle_background_color="rgba(255,255,255,0)"]
Here is an unedited transcript of the webinar “How FuseMail keeps their secure email service performant with InfluxDB” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers: Chris Churilo: Director Product Marketing, InfluxData Dylan Ferreira, Lead Systems Administrator, FuseMail
Chris Churilo 00:00:01.890 All right. It’s three minutes after the hour. We’ll get started. I know there will be a bunch of people that join in after, but it’s okay. Good morning, good afternoon, everybody. My name is Chris Churilo and I work here at InfluxData. And today, I’d like to introduce to you my friend, Dylan, who works at FuseMail, and he’s going to share with you how they use InfluxDB to help with maintaining the latency of their email service. And as I mentioned earlier, if you do have any questions, feel free to put your questions in the chat or Q&A. And if we get a break during the presentation, then Dylan will be able to answer these questions. If not, we’ll definitely get to all the questions at the end of the webinar. So don’t be shy. Make sure you cue up your super hard questions for Dylan. So with that, I will let you introduce yourself and get started.
Dylan Ferreira 00:00:54.740 All right. Hi, everybody. My name is Dylan Ferreira and I’m a Sys Admin at FuseMail. I work for FuseMail, but it is owned by j2 Global®. j2 Global® is a big company that runs all sorts of sites like Mashable, IGN, PC Mag, Speedtest.net. They also have a whole bunch of Internet security offerings and that’s where we come in with FuseMail and the VIPRE product lines-we provide anti-spam and anti-virus. And I work on the FuseMail side, where we provide email security solutions, like anti-spam, anti-spoofing, anti-fishing, custom filtering, and stuff like that. So presenting, not one of my big strengths, but with luck, I think we can all get through this. I have a lot of information I want to go through, so in some spots, you might start hearing me just droning on like a robot trying to read some stuff off. Shake me out of it with questions whenever you can. I’m going to try to remember to slow down and just talk here and there about some of the stuff that we’re going through.
Dylan Ferreira 00:02:00.442 So what are we doing here today? I’m going to take you through a project that we started in July of 2016. So we’re going to go through what we were trying to do, how we decided on InfluxDB, how we tested InfluxDB so that we could understand what we were going to do with it, understand how it responded, what kind of config we needed, and how much cardinality we could store in it, some of the lessons that we learned while we were doing that testing-there was a bug that I found along the way that was kind of fun to work on-and then the initial design of the email latency collection system. And then I’m going to go into the wonders of raw data and why it was so important to have the raw data that you can get with Influx as opposed to an aggregated system. And then, at the end, I’m going to go into how we used Kapacitor in a redesign to satisfy a GDPR requirement. So we’ve gotten a little bit into Kapacitor and I want to just go through. So in mid-2016, we were given this task of collecting email latency statistics on our core mailer, i.e. how long does it take for a message to be processed by our mailer. And when I’m talking-oops-when I’m talking about a mailer, this is what I’m talking about, the red box. So the timing data’s confined to the time our mailer spends processing messages. And by processing, I mean scanning, filtering, and routing. The requirements are that we keep long-term latency percentile data in daily, weekly, and monthly aggregates so that we can report our performance and track it against our SLA, we can review our software updates and deployments and check for changes in overall performance, and improve our capacity planning. And up until then, we really just relied on queue threshold monitoring to detect delivery lag. And queue monitoring alone shows you when you have really big problems, but it doesn’t show you any of the little problems. So there was lots of small issues that were bubbling under the radar. The other issue with relying on a single metric like queue depth, that it’s impossible to tell the difference between an increase in message rate or a timeout caused by an unhealthy dependency.
Dylan Ferreira 00:04:29.635 So the initial idea that was put forth was just to send probe messages through our mailer. So the idea is that we would just periodically send messages through the platform, catch them on the other side, and track how long it took for those to go through. So the process can be broken down like this: inject, generating and injecting a probe message into a mail server; capture, delivering the message into a script either directly or through a host mailbox; parsing, you have to parse down the headers to calculate the time spent on route; and the clean-up, cleaning up whatever artifacts are left by the process. So issues with probes. We’ve used similar setups like this over the years and I’ve always found them to be fragile. You have to maintain lists of hosts. We run nine production data centers for route email traffic and there’s a ton of these mailers in each of these data centers. And maintaining a list of the active mailers becomes a maintenance issue, making sure that you have probes where you need them and making sure that you’re not tracking probes for mailers that are no longer functioning or we’ve taken them out for maintenance. Maintaining special probe accounts. Every time you set up something like this, you have to set up special accounts with routing configs and policy configs so that you have an idea of what you’re testing. And there’s so many paths through the platform that it’s really difficult to generate probes to test all possible routes through our platform, all possible virus scanners, all possible spam hits, and all possible routing types. And it’s close to impossible to test all of these with all of the different types of message payloads that come through. And the end of this is probes add extra load. Now you’re sending tons of probes through your system and you’re effectively altering how your system performs just by sending so many probe messages and trying to maintain them.
Dylan Ferreira 00:06:39.622 So the result of this setup is a lot of new stuff to manage. There’s a ton of moving parts in a setup like this and there’s tons of ways it could fail. You could get messages stuck in various queues and messages not being deleted properly or messages not being parsable. So you need a ton of monitoring to stay on top of just your probe messages now. Happy path testing. No matter what you’re doing here, all you’re doing is just testing your probes. You can never match with probes the diversity of the actual email traffic. And coarse and often incorrect metrics because all you’re really getting here are metrics on your probe data and your happy path. So you can only send so many of these, and the aggregations aren’t super useful.
Dylan Ferreira 00:07:31.532 So in a recent article on Medium by Cindy Sridharan-that name always gets me - called Testing in Production, the safe way, she talks about a similar problem with integration testing. She says, “Performing integration tests in a completely isolated and artificial environment is, by and large, pointless.” And I feel probe tests have a lot in common with this. So it came up, “Why not measure the actual production traffic?” So here’s where we got some lucky timing. Around the same time that we were assigned this project, our mailer team was working on building event logging directly into our mailer. So the mailer would publish information on deliveries into NSQ for consumption by various downstream services. NSQ is a simple distributed message bus. You can publish any kind of content: raw, text [inaudible] or what have you into a topic and one or more consumers can create channels on that topic and get a feed to this data. It wasn’t a great leap to see that we could tap into this to gather stats on our production traffic, so the mailer team kindly agreed to add high-precision timing data to the payload. By using the actual production traffic, we didn’t just get rich, more accurate aggregations, but we also got message counts and, effectively, a new form of logging that allowed us to zero in on customer experiences.
Dylan Ferreira 00:08:52.528 So onto the next question, how do we store this data? FuseMail’s been around for nearly 20 years. And historically, we’re a [inaudible] with various Graphite [inaudible] setups to store perf data and application metrics data. In 2016, we were starting to move most of our metrics data over to Prometheus with Grafana as the front-end, so we had a few choices. And having worked a little with Prometheus and PromQL and the labeled metrics simple Go binary install with no dependencies, it was really hard to get motivated to set up any Graphite project. However, Prometheus is aggregated by design and the event log provided us with a wealth of raw data. Graphite could take this raw data in a limited way, but this was pre-v.1 Graphite and so all desired metadata had to be encoded into the metric name, which made for difficult and unsatisfying tradeoffs when exploring and aggregating the data.
Dylan Ferreira 00:09:51.187 So enter InfluxDB. We had worked with InfluxDB in the past as a storage solution for a logging project, so we knew it was pretty capable. With InfluxDB, we had all the benefits we saw in Prometheus with labels, without the tradeoffs of immediately aggregating. And on top of this, we could store multiple fields per row, so lots of useful context could be stored along with the metrics. So in this slide, I kind of wanted to go over-we do use all three of these systems and I think they all have a place. We still make a lot of use of Graphite, we still have semi-[inaudible] set up and perf data goes into our Graphite system. Also, I think StatsD still has a place. It’s great if you have an ephemeral task, it’s great. Just a small little batch job or something. It’s really easy to throw in StatsD metrics and store them somewhere. So Prometheus we use a lot of and InfluxDB we use for any kind of event logging.
Dylan Ferreira 00:11:04.101 So we decided to use Influx for this. So we needed to run some tests and sort of get a good feel for how Influx works. So some of the goals were, how fast can I push data into this database with the hardware I had available? We had doubts that we could keep up with the traffic that we were pushing. How much storage would I need? So I wanted to know how much field data I could put in the database and how much cardinality, basically, I could store. Look into the schema design in cardinality, so just try to understand how much cardinality I could put in the database before it started to slow down or become unresponsive. Get a feel for the InfluxDB config and the ingest config for the data. I learned later that it really matters how you push data into Influx to get it fast. And the impact of continuous queries on the database will under-load because I was worried that if we started to do aggregations on the data, it may cause the data ingest to slow down and be a problem. So I wrote a producer in Go to write rows of random values into Influx to get an idea of how much load the database could take. The producer could be configured with a configurable amount of cardinality, something that kind of mirrored what I hoped to put in. The rates I could configure and the concurrency, so we could dial up the pressure on Influx until we overwhelmed it. And we quickly learned that Influx is much happier ingesting in larger batches, so I started playing with data rates and batch sizes. Eventually decided on a maximum of 8,000 entries per batch. I switch from randomly-generated data to incrementing field data at one point because I needed to audit and make sure that all the data that I was pushing into Influx was getting there. But I find that building dashboards with test data, it’s easier to work with random data. To some degree, you don’t want to make a dashboard where you have to use rate functions and stuff, where normally you wouldn’t. So building the dashboards with mock data, it’s easier with random. But if you need to audit your data, it’s much easier to use something incrementing or a known pattern. Whoops.
Dylan Ferreira 00:13:40.009 The lessons learned. While I worked through my load test, I ran into a few snags. And actually, I just put this picture up for Chris because I actually met Chris and we talked about this webinar in Grafanacon in Amsterdam. And I was stopping in Iceland on the way home and she wanted to see pictures. So this is my picture and it kind of matches the tone here. So working with some mock data, I was just playing around with pushing data in. In my head, I was still thinking in terms of StatsD, so I was pushing data into InfluxDB via UDP. In the early days, I wanted my project to fail in such a way that it really didn’t impact production in any way, so I used ephemeral NSQ channels and UDP. An ephemeral NSQ channel is basically a channel off of the queue that can only get so high, and then it just lets stuff-it just dev/nulls stuff after it gets to a certain height, so you don’t end up with memory problems or a really big queue. And in our environment, we don’t alert on ephemeral channels. We assume that if you make an ephemeral channel on a message bus, it’s not something that needs to alert an on-call person. So in most metrics projects, I try to make metrics collection not be something that actually ever impacts production. There’s very few metrics collection projects in my thinking that is worth actually impacting production or causing alerts to on-call people. So anytime I build something, I’m thinking in terms of, “How can I keep it so that it doesn’t have any impact?” Or if it does have impact, I have it alert a special alert channel in pager duty just to me.
Dylan Ferreira 00:15:41.865 So the graph above shows a per-second ingest rate in the database. The data is coming from a test script, so it should maintain a constant 900-samples-per-second. But occasionally, it’s dropping. So I checked my test script logs and I found that it was producing data as I expected it to, so what the heck is happening? Is something wrong with Influx? No. It was just UDP. Anytime the host was under pressure, I saw data loss. So I’d seen a talk by Joe Damato of PackageCloud where he detailed the many ways that a kernel will give up on UDP, making the point that you can lose data sometimes even with just impossible-to-detect tiny CPU spikes. You’ll lose data in your UDP channel. So the screenshot above shows the impact of a maintenance job that was running in an adjacent container on the same host. Note we’re only seeing drops here with no corresponding spike, so we know we’re losing data. I realized this approach was doomed, so I switched over to the HTTP API.
Dylan Ferreira 00:16:47.298 So continuous queries. The documentation describes continuous queries as InfluxQL queries that run automatically and periodically on real-time data and store query results in the specified measurement. I got my feet wet with continuous queries by adding simple aggregations to get ingest counts. Selecting per-second counts on high-res data across long timespans was interfering with my tests. So I set up a continuous query just to get that. But there’s a catch, I’ve found. So I added the continuous query shown above where I calculate a per-second rate once every 30 seconds and I noticed a strange, repeating pattern in the data. The graph above is showing the ingest rates on the data. The green line is showing the actual ingest rate, so that’s just a query that I ran across the entire timespan to calculate the ingest rate. And the blue line is showing the output of the continuous queries. After staring at this for a while, I realized I was watching this continuous query actually beating the ingest data. So I was generating a timestamp and then I was sending it to InfluxDB. And by the time the timestamp actually made it into the database, the continuous query had already completed. So I was getting this kind of harmonic that was repeating over and over again and that’s where I realized that I needed to add this little bit of overlap. So it’s really, really important, I’ve learned, if you’re not asking Influx database to generate your timestamp for you, if you’re generating your timestamp and then sending it in, it’s really important to make sure that you add a little bit of overlap in your continuous queries to go back and re-sample the last little bit of the last time range, just to fill in whatever might not have made it in by the time you did your continuous query.
Dylan Ferreira 00:18:51.394 Mockup your dashboards early. So just like building UIs, building your dashboards first helps you understand exactly what kind of data you want to collect and what kind of aggregations you want on that data. And then you can go back to your actual data collection and figure out what you need and how you need to store it. So the same goes for mock data. It helps to test with mock data at known rates and known cardinality. The test data above was generated with random field values, but like I said, I switched to incrementing field values later to make it easier to audit.
Okay. This is a super boring slide. I found a bug. So while I was auditing some early live data, actual event log data from production mailers, I noticed some weird stuff in my graphs. I noticed negative latency numbers. And while I’d love to achieve negative latency, I suspected that we had a problem. So I put a ticket in with the mailer team to check their event log code, but they had audit logs containing the raw event log data and we weren’t seeing the negative numbers there. So it can’t be NSQ because NSQ just transmits whatever data you put into it, so it’s not going to go and modify portions of the JSON. And I added some more debug in my consumer and I couldn’t find anything wrong with my consumer either. I just logged out of the InfluxQL that I was pushing in and I couldn’t find those negative numbers there. So what could it be? I started to think it might be InfluxDB, so I went on GitHub and I went through repo and I found somebody reporting a similar issue and it indicated that the problem wasn’t just confined to negative values, but sometimes positive values could be wrong as well. By comparison, the positive numbers are pretty much impossible to spot without a detailed audit. The negative ones are easy to spot. There was a pattern with these anomalies I noticed: they all occurred when I was replaying data. Because I was just testing, so I was sort of running data in and then I’d run it in again. Sometimes I was doing multiple replay runs at the same time. So I started working on ways I could trigger this problem because, at that point, nobody knew what was going on. So I put this test together. Okay. So you’ll find that this bug is actually pretty minor because I really had to abuse Influx really badly to actually make it happen. What I do is I insert rows in batches of 8,000 with a precision of one second. So I’m inserting them as quickly as possible, so there’s obviously going to be tons and tons of overwrites of the same time offset in Influx. The rows contain three fields. One is a batch counter, which is just an incrementing number for every batch that I’m posting. And then another pair of counter fields that contain the same incrementing number set from the same incrementing variable. The idea being that counter one and counter two should always match on insert and they should always match on the database because you’re just pushing in the same numbers. But when I ran this against sort of early Influx, pre-v.1, with a concurrency of 10 workers running against the same database as fast as possible, I found a small percentage of my rows actually showed up with different numbers in counter one and counter two, meaning it was kind of randomly selecting this stuff. And a simple SELECT * where counter one not-equal counter two pulled out all my errors. So I filed this issue with Influx along with the test code and the Influx team verified, actually, that this was a known race condition with the database at the time. If you really badly abused it in overwrites, at some point, it would be writing fields into separate series and it would not write the same fields from the same insert, basically, into the row as a final write. So this wasn’t a huge problem for me because I really didn’t have any intention of doing this in production. Basically, it boiled down to if you were overwriting a single time series with the same time offset, with concurrent connections, a ton, then your resulting field data could be wrong. And you really shouldn’t do that anyway. But once the new TSI index was released, I wasn’t able to reproduce this issue. So I actually looked it up and the other ticket that was talking about this said, “Yeah. It was confirmed to be fixed in v.1.1.”
Dylan Ferreira 00:23:47.174 So, prototyping tools. And keep in mind, this is all sort of mid-2016, so there are some things I’ll go into that can actually do a lot of this now. So what we have so far is our mailers are producing event log data into NSQ. And for prototyping purposes, we can rely on nsq_tail to consume data. So nsq_tail just works like regular tail but on an NSQ channel. So you can use nsq_tail to hook into a regular or ephemeral channel and it’ll just stdout write all of the data that it’s getting in that channel. If we need to transform or pre-filter the data in any way, I can pipe it through jq. I often use jq to add context, like what data center I’m in, into the json, so I don’t have to ask everyone creating event logs to add this awareness into services which otherwise don’t require it. But to work with this or any other json formatted data, I needed one simple piece of glue, which I call json_to_influx, and this completed my prototyping toolkit. So json_to_influx takes a stream of JSON in payloads on STDIN-this isn’t latency log data. This is just some other event log data that I get from parsing sys log for looking for out of memory kills, segfaults, and disk problems, and stuff like that. So it takes that in, it outputs Influx line protocol to the HTTP API, and it uses a config to map certain elements of the json into the timing index, into your tags and your fields. Time indexing tags are simple; it’s just a one-to-one assignment. Fields are a little more complicated in that I cast it to a type. In the target section, we can figure a DSN with optional batching parameters. So basically this thing you can tell it to batch. So what it does is you say maybe batch of 8,000, so if you hit 8,000, it sends a batch in, and there’s also a time ticker that you can set that says, “When five seconds pass, whatever you’ve accumulated, just push that in as well. If the data is critical, I don’t use this rig. I prototype with this rig and then I write a dedicated consumer based on what I built in the prototype. Because the prototyping here is pretty leaky. If things get chaotic, you’re going to lose data. There’s no reQ mechanism or any kind of back-pressure mechanism here. You’re just piping data into various things. The JSON Influx code works great as a template when you’re building your final product. So I’ll refer to this piece, nsq_tail, jq, and json_to_influx as the “consumer” for the rest of the talk. So that’s what I mean when I say consumer. So what I wanted to emphasize with this slide is I’m not a developer. I’m a sysadmin. But I can code a little bit. Pretty much all of this was built with off-the-shelf stuff. And it’s very easy to twist it into whatever shape you need. And actually, since I built JSON Influx, Telegraf actually now has the capacity to do almost all of this as it has an NSQ consumer now. And it can take json in and actually transform it into line protocol into Influx. Although the json transformer in Telegraf is a little less configurable in that it can only handle numeric field data and it ignores any kind of string data. And in this case, I really wanted to store a lot of string data, like message IDs, any kind of UIDs, or things like that, I wanted to store as fields, so I would have them later for whatever kind of aggregations or logging stuff that I wanted to do.
Dylan Ferreira 00:28:03.952 So this was v.1 of the email latency setup. It looked like this, a set of consumers in each DC batch-posting into a central Influx database instance, where continuous queries aggregated the data down. So we run dashboards directly from this instance. And at the time we created v.1, we found the database couldn’t handle the combined cardinality of hosts times customers, the number of hosts that we had times the number of customer numbers we wanted to put in. So we actually ran two of these rigs, one tagged by data center, hostname, and several other low-cardinality, low-priority tags, and one tagged by data center and customer, which just let us look into customer-specific experience problems. But when v.1.1 came out and the TSI was released, I found that a single database could easily handle all of the cardinality. And in fact, the load was pretty low. This was a surprise because pre-TSI, I started to run into trouble with cardinality of 200,000 or so. But running with the TSI I have now a cardinality of 1.75 million and it doesn’t really seem to be working very hard at it. So I feel like I have a lot of headroom there.
Dylan Ferreira 00:29:20.183 Okay. So on to some fun stuff. Let’s look at some data. So looking at the raw data over the first few weeks, we started seeing some interesting patterns. Little micro outages and timeout retry patterns. Lots of them are in the second and sub-second realm, but they directed us to underlying config problems, query issues, hardware problems that went unnoticed by our threshold monitoring. And after looking at the raw data for a while, we started to recognize distinct patterns in the data, like glyphs or like signatures of specific types of problems. So raw data is kind of funny that way: you end up starting to read it. So these shapes put you onto a particular problem that you should look into. So we started with the raw data and then we use it to build a, hopefully, lighter-weight view. Because this kind of raw data scatterplot is pretty heavy on your browser in Grafana. So here’s a great example of raw data making it easy to solve problems. We have a scatterplot here of some early live test data and each dot represents the delivery of an email with the Y access being the time it took to process in milliseconds. So most messages are being delivered sub-second. We have some issues, but what the heck is this? What are these batches of deliveries at 30 seconds? Turns out, this was just a non-locking dependency that was locking up until it hit a timeout and then it just let the message go. We didn’t expect the raw data to show us stuff like this so clearly. So we ended up staring at scatterplots quite a bit after this because we realized we could actually tune everything in our stack just with this data. We could see all sorts of stuff that was really, really difficult to get out of the logs. Because, really, in a one-hour view like this, we’re looking at hundreds of thousands of deliveries, and it’s hard to pick that stuff out.
Dylan Ferreira 00:31:25.785 Okay. Warhol Vision. I built a dashboard specifically for looking at these issues called Warhol Vision, and I only called it Warhol Vision because it just reminds me of Andy Warhol paintings. Its main feature is four large panels running nearly the same query, but each panel shows the data tagged by one of four main tags: direction; inbound versus outbound traffic; data center, which data center that traffic was in; mailer, which mailer or which host this traffic was in; and customer, who the message belongs to, which customer the messages belong to. So this kind of lets us view all the critical dimensions of the data at once and helps us understand the scope of an issue so we can get to the debugging faster. Because I find that there’s so much information now in the tags that on a two-dimensional graph, it’s really hard to express that. So if you kind of look at this dashboard, you can immediately sort of look and see where it is, who it belongs to, or if it belongs to any one particular person. In some cases, it may belong to everybody, so you know it’s more of a host problem than customer problem. Anyway, I put a configurable high-pass filter on this so it won’t crush my browser, because when you’re looking at larger timespans and you’re downloading hundreds of thousands of data points four times over, you’ll just kill your browser. So we’re looking here at just messages that took one minute or longer to be processed. So this is only the stuff that’s breaking the SLA. So looking at this issue, we’re looking at the scatterplot of raw message latency data, Y-axis’ processing time. And the top panel shows a spike in ingest rate, meaning that we know that we received more messages during this time. And the spike corresponds to the issue that we’re seeing in the other panels. So the direction panel tells us that this is outbound traffic. The DC says, yep, this is in DC6. The mailer panel shows it was only on one host in that data center and the customer panel is showing us that the messages all belong to one particular customer. And you’ll notice that, although this issue was caused by one customer, it had an impact on other customers being processed. You can see that little rainbow geyser coming out of the side of the customer there. That’s other messages that got dragged down because of this. The Grafana ad-hoc filter is very cool for this. It works with Influx. I can’t say enough good things about it. It just makes it so easy to explore your data because you can basically just pick any of your tags and narrow your data down by it, and you can chain it together. So if I wanted to just look at data for just this customer, I could select the customer number in the ad-hoc template at the top and all of the panels would only then show messages for that customer. So I could see if messages for this customer were going through one or more data centers perhaps, or what hosts they were impacting, etc., without having to burden my browser with all sorts of other data that I don’t actually care about.
Dylan Ferreira 00:34:47.336 So going back, we started off looking into just doing some probe messages to keep track of how fast our mailer kit was functioning and how fast we were delivering stuff. But because we went into the “measure live traffic” thing, we ended up having all this extra data and we started to just use this data just to debug our environment. It’s something we didn’t anticipate. So this was something that really struck me, was that it became a much more useful tool for other things. And in fact, in the end, the actual SLA monitoring stuff was kind of a side thing because we ended up using this for debugging pretty much all of our system problems. But there’s a catch. Sometimes we look at scatterplots for too long. So this popped up at one point and I saw a rocket ship in it, and I found a picture that pretty much matched it. And I had to tweet it to one of the Monitorama lecturers who really liked to take graphs and put pictures on them. So I tweeted him and get got all excited. Anyway, don’t stare at your scatterplots for too long. It’s bad for you.
Dylan Ferreira 00:36:13.101 So my boss asked me to put numbers up that he could report on. And of course, this was the whole reason why we did the project in the first place. But after weeks of pouring into the wonders of scatterplots, I wasn’t super excited to build a big wall of numbers. I did find it difficult to build these queries, though, and I ended up having to rely on Grafana to lock the aggregations to specific date boundaries like week or month. I ended up having to take-if I wanted to get a daily rate or a weekly rate, I had to ask Influx to aggregate hourly and basically Grafana to do a sum query of Influx of hourly aggregates and then lock it to a particular date range. Anyway, this is all great stuff. When I began, I thought I’d be spending most of my time figuring out how to get data into InfluxDB efficiently, but once that was taken care of, I found myself spending even more time just looking at the data and finding new ways to express it.
Dylan Ferreira 00:37:14.424 So here’s a collection of other graphs that we ended up making that we didn’t really anticipate when we first started doing this. But because we had all of this extra data, we suddenly started realizing it was super simple to get other things. So the bottom graph is a good example-or panel. It shows a filtered hourly view of the slowest delivery times by customers. So we ended up having this up on a dashboard in our office and anybody in the office can basically see that data and start asking questions, like, “Why is that one so slow?” right or, “Oh, I know that customer.” So it’s super handy to have stuff like this up because everybody gets involved.
Dylan Ferreira 00:38:06.303 So Kapacitor. InfluxData describes Kapacitor as a real-time streaming data processing engine. And we’ve been using Kapacitor at FuseMail for about a year. In the early days, when I was first looking at Influx, Kapacitor, I had no idea what this thing was for and I was really happy that Influx had continuous queries and I didn’t have to think about it too much. But I got into Kapacitor and I started pushing a whole bunch of event log data into it out of our mailers and writing TICKscripts to filter and aggregate the data down. So this is largely considered R&D work, but it was starting to get used by our support people quite a bit to spot spam. I was basically writing TICKscripts. What I was doing, actually, is I was going to the support people and to the other sysadmins, saying, “What types of things are you looking at for when you’re trying to spot a compromised account or an inbound spam problem?” And I translated those into TICKscripts that would then aggregate the data that I was getting from our mailers into something, into some kind of view. And some of those views actually paid off and it was actually getting really easy to spot any kind of inbound or outbound spam problem or any kind of compromised account problem. Types of things that I was doing were things like recipient address count by sender address, or hello count by sender address, or a sender domain count by IP. So how many IPs is one sender sending from at any given time, just a count per-minute. Because that kind of thing makes it easy to spot-usually, when people are sending email, if they’re sending a lot of email, something that actually comes up on your radar, you’re not going to see it coming from a whole pile of IPs unless it’s a bot or something like that. So this kind of worked for high cardinality data. I found it just ate a lot of memory but so far, I’ve been really happy with it. And I use this kind of setup where I’m basically just pumping data from NSQ straight into Kapacitor with my consumer rig and then I just leave it there basically. I just push data into it all the time. And whenever I have ideas about something that I want to see, I’ll write a TICKscript that will take the data and take a stream of that data and just put an aggregation in Influx database somewhere so I can put up in Grafana. And it’s really great for just exploring stuff.
Dylan Ferreira 00:40:50.226 So that goes into my final review. So the v.1 was the mailers pushing data into NSQ. I have a consumer that then pushes data to a central Influx database. And I have nine data centers, so really, the data center panel you’re seeing there, there’s nine of them all pushing into a central Influx database. But we have a problem. GDPR is coming up and you can no longer export personal identifiable information out of a user’s region. Personal identifiable information, or PII, includes information like email addresses, IP addresses, etc. And our model requires that we push all of our data back to a central database to be aggregated. So why not use Kapacitor and use a similar model to the R&D rig that I was working on earlier? In this model, we write the raw event log data in each DC to a local Influx database in that data center, and then we use Kapacitor to perform batch queries. You basically just tell Kapacitor to hook into a table and do a query against it. And you can aggregate then locally but you can also aggregate centrally. The per-DC Influx databases can have richer data. It can hold IP addresses and things like that, but with tighter retention policies, while the central Influx database gets high-level aggregations and it drops all personal identifiable information for long-term storage. And as a bonus, this model scales better as the per-DC databases only have to cope with the raw data from a single DC, while the central database only has to cope with ingesting the long-term aggregations. Sensitive data doesn’t have to leave the data center and we still get our long-term percentiles and counts in a central place that’s easy to query. So I have to be honest, the idea for this v.2 rig came before I was worrying about GDPR. I originally wanted to do this just to make it easier to manage the SLA database and scale it out because one giant database isn’t really great. So the other really great thing about this rig is that any time you have any questions or you want to do anything, you can just add another TICKscript to your Kapacitor rig and you can choose where you want to store the data. And you can even store the same aggregations in multiple databases for redundancy, so now you’re not just worried about having one single database that perhaps you can’t rely on. So some of the problems that I haven’t worked out yet with this are, with my batch queries, if I am writing to a remote database, I’m not sure yet how to handle not being able to actually reach that database and how I would then backfill that data. But there’s ways around that. I’m not sure if directly writing to Influx is the path to take here, but I’m still learning. And I should have put those up earlier.
Dylan Ferreira 00:44:08.796 So we’ve used the same pattern: event log into NSQ into consumer on several other projects. And one of the most useful that I’ve done is batch job metrics to keep tabs on the thousands of automated periodic tasks that are running in our environment. I think this is just a five-minute view, and in this five minutes, we actually have 40,000 events. Some of them are cron jobs, some of them are looping ephemeral tasks that run under supervisor D. But this is how we keep track of how healthy those things are, whether or not they were exiting healthily, whether or not they were hung, or whether or not they’ve started to take longer and longer to run or something like that. Yeah. Anyway, so this pattern works really well, I find. Once you get into the habit of writing services that post event log data into a message bus and then you put together a little prototyping toolset, it becomes really easy to stand up temporary ad-hoc view on anything in your environment.
Dylan Ferreira 00:45:12.790 So that was my presentation. Thanks for watching. And also, at FuseMail, we’re hiring. Had to get that in. Anyway, I guess we’ll go to questions.
Chris Churilo 00:45:26.495 Cool. That was really great. And it’s totally fine to promote that you’re hiring [laughter]. And thank you for the picture earlier on.
Dylan Ferreira 00:45:36.748 Oh, no problem.
Chris Churilo 00:45:36.832 As we wait for questions from the attendees that joined us, just thinking about the GDPR slide that you had just a few minutes ago. And I think so many of us just think about the actual data that we collect from our systems to maybe invoke a subscription of whatever type-but I think it’s really easy for us to forget that we need to use the same data for collecting metrics to determine latency kinds of issues. And all of a sudden, whatever model we might have created to securely store that information in the right location is now getting propagated. Not with bad intent. It’s just really simple to forget because it’s usually two different teams that are dealing with that. So I thought that was a pretty great use case that you shared with us. The other thing that I wanted to ask you about is that-you mentioned this a couple of times-at how you were surprised, once you were able to look at all that raw data, at how many other, I guess, kind of viewpoints that you could have from that data. Maybe you can just touch on that a little bit more.
Dylan Ferreira 00:46:54.995 Yeah. So once we started looking, I really didn’t expect to see anything but noise. And I thought, “Oh, I’ll just probably make a scatterplot and it will be really cool to show how many data points there are in a timespan.” I just thought I’d see a big box of dots. But all sorts of weird patterns-and it was kind of fun because the weird patterns, they’re actually pretty fun to look at and everybody started picking stuff out of them and figuring out, “Oh, wow, what’s that sort of diagonal line? What is that?” right. Well, that’s something that’s a serial delivery that’s happening where there’s a timeout on every single one. So you get a diagonal set of lines at specific intervals, right. So you start to get good at spotting, “Oh, I see a line, but that line’s not on an exact second, so it’s probably not a timeout. Because you’re picking timeouts, you’re picking from a small set of regular numbers, right, like a second or 5 seconds or 30 seconds, or something like that, or 5 minutes. But yeah, it wasn’t just our mailer that we started debugging. We started finding problems with other things that were impacting the mailer because of a bunch of shared resources like database and disk. Yeah. So it’s quite something. In the end, we have a lot more dashboards built off of that data that are just there looking for anomalies, like retrospective dashboards that we build up to look at a particular problem than we do for the original idea. And things like that, that table of worst customer experiences in the last 24 hours sort of thing. If I was given the task of coming up with a way to store that as my original task, I would have kind of been overwhelmed. I wouldn’t really know how to start doing something like that. But because I had all that data, it was actually just an afterthought, “Oh, wow, well, I could just make a continuous query to grab that data.” It will be a really lightweight table because it’s filtered for the worst experiences, so it’s only a small bit of data that I’m throwing in there, so it’s pretty lightweight to get that. But it’s super useful and it’s really not what the original intention of the system was. And things like having the customer numbers as a label means-and we’re taking the data from the actual customer deliveries. We can actually find that certain customers are having problems and we can address that, rather than just probe messages where all you get is, “Well, as a whole, the system is good or not,” right.
Chris Churilo 00:49:46.482 Right. Right. But what you’ve been able to design is really a real view of how the customers are experiencing your service, which I think you said was kind of the goal upfront, right. And that’s the whole point, we want to make sure that our customers are happy with our service. And I also-
Dylan Ferreira 00:50:05.977 Yeah. To be honest, we didn’t even-when we first started talking about storing the raw data, we kind of laughed it off as, “Well, that’s going to need a lot of resource and we don’t want to buy hardware for that.” It was kind of thought as, “Why don’t we just do probe messages? Because it’s going to be really hard to store all that data.” And it turned out to be, actually, not hard at all to store that data. What we have now is running at a load of 1.2 or something like that. And it’s got a massive cardinality, for me anyway. 1.75 million is a ton to have available to one measurement. Yeah. So we really didn’t think it was-it was really thought of as kind of a joke. And then when we started working with Influx, “Oh, wow, this is actually going to work.”
Chris Churilo 00:51:00.874 I really love the one chart that you showed of the four different dimensions, the customer versus data center. I mean it just seems so obvious after I looked at it that, yeah, you have those four different tag values. Why not look at them just individually to see if you can spot anything different?
Dylan Ferreira 00:51:20.633 Yeah. That came from a lot of pain. I was trying to think of ways to show views and to pull out specific problems and to do it all at once. And there’s no way to do that in a 2D graph that I found yet. It’s all either you just end up with a wall of technicolor dots that don’t mean anything and you have to hover over one to get all of the context out of it. So, yeah, that did work out. I’m happy with that.
Chris Churilo 00:51:47.839 Sometimes, the simplest solutions are just the best. So if anybody has any questions-I think everybody’s just in awe of what you’ve done. Great job, by the way, Dylan. I really appreciate that you walked us through your journey from the beginning to where you guys are now. So as we just leave the lines open for questions, what’s next for you guys?
Dylan Ferreira 00:52:18.823 Well, I definitely want to play a lot more with Kapacitor. I think it was a couple of webinars ago that I had joined, I learned that actually pushing data directly into Kapacitor like that is kind of not recommended and it’s better to push data into a database first and then hook Kapacitor in like a stream. Take a stream of the table and then do your aggregations and stuff. So I want to go back and revisit the anomaly-detection stuff that I was working on. That stuff I just got super excited about because identifying spam, a compromised account or a system can be difficult and it can be very difficult to identify incoming spam waves. And the faster we can do that, the better it makes it for our customers, right. So I definitely want to play a lot more with Kapacitor. And actually, Telegraf-when I first started working with the TICK Stack, Telegraf was basically just, to me, it felt like it was just a monitoring piece that you stuck on your machine and it would gather system metrics and stuff. But looking into it recently, wow, it’s got so much stuff in it now. I can pretty much replace my json_to_influx thing just with Telegraf if I don’t need to store string-based field data. So I’m kind of excited to delve into Telegraf a bit and make more use of it.
Chris Churilo 00:53:53.357 It’s interesting, you’re not the first user to say that. I think sometimes people are a little bit surprised by the power of Telegraf. People are like, “Oh, it’s just a little collector agent,” but, yeah, it’s made a lot of progress. You did get a comment from Roman. Roman says, “Awesome webinar. Thanks a lot.” And I agree. In fact, what I’m doing to do, Dylan, is that we have a regular lunch-and-learn with our developers here at Influx, and I’m going to actually play this recording because I think they’ll really appreciate the journey that you took. And who knows, maybe Michael or Nathaniel will even help you with your Kapacitor questions.
Dylan Ferreira 00:54:34.148 Cool.
Chris Churilo 00:54:36.593 All right. Are there any questions? You guys must be just excited. But if you do have any questions after, as often times, that happens with these webinars, just shoot me an email and I will forward it off to Dylan. I’m sure he’ll be really happy to answer your question.
Dylan Ferreira 00:54:52.621 Absolutely.
Chris Churilo 00:54:54.341 And then if you ever go to any events, Dylan goes to a lot of these events, so look him up on Twitter. And he’s got his Twitter handle here on the page. And you never know, you might bump into him. Are you going to Monitorama this year, by the way?
Dylan Ferreira 00:55:11.975 Absolutely. I love
Chris Churilo 00:55:16.564 Yeah. So Nate, who you met in Amsterdam, will be going as well as a couple other people. And if any of you on the attendee side are going, make sure you connect with Dylan and you get a chance to talk about what he’s doing at FuseMail because it’s pretty interesting. All right. Well, with that, thank you so much, Dylan, and thanks everybody for sticking around. And I will post the recording later on today. And like I said, if you have any questions, feel free. I will definitely be forwarding them. And just so you guys know-in fact, there’s a couple of guys that are in the attendee list, you know that I will forward your questions to the speakers. I definitely want to make sure that I can help you guys get connected so that you can make your project a lot stronger. So thank you again, Dylan, and thanks everybody for joining us.
Dylan Ferreira 00:56:11.355 Thanks. See you.
Chris Churilo 00:56:12.832 Bye-bye.