Coming soon! Our webinar just ended. Check back soon to watch the video.
In this webinar, Steve Moreton, Senior Technical Director of CJC, will showcase how they built a big data visualization platform used in the capital markets for ITOA (IT Operations Analytics). MosaicOA uses InfluxDB to store metrics from hundreds of specialist servers, networking and middleware systems, with some firms doing 1 million database writes per minute. The data (CPU, memory, network and application data) is queried and visualized for clients to view for a variety of beneficial use cases, including root cause analysis, capacity management and machine learning. Steve will detail the many aspects of building such a platform, as well as the operational and on-boarding challenges to designing killer visualizations for the demanding capital markets IT crowd.
Watch the webinar “How CJC Built a Performant Big Data Visualization Platform to Be Used by Their Capital Market Customers for ITOA with InfluxData” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How CJC Built a Performant Big Data Visualization Platform to Be Used by Their Capital Market Customers for ITOA with InfluxData”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Steve Moreton: Senior Technical Director, CJC
• Peter Williams: CTO, CJC
Chris Churilo 00:00:00.480 So with that, as I promised, it’s three minutes after the hour, so we’ll go ahead and get started. Good morning, good afternoon, everybody. My name is Chris Churilo. And I work at InfluxData. And today, we are honored to have Steve Moreton from CJC, who’s going to be reviewing how he and his team actually built the Mosaic product with using InfluxDB as the backend Time Series Database. So with that, I’m going to pass the ball over to Steve and have him get started.
Steve Moreton 00:00:30.689 Okay. Thank you very much, Chris. Firstly, I just want to say hello to everybody. I think it’s great attending the conference, from what I’ve been told. I just want to say thank you to Influx for giving us the opportunity to say what a great relationship we have. We’ve built a very successful product, and we’ve got a fantastic partnership. So thank you very much there. To the group who are on here today, I just want to say that who CJC are just to give you a little bit of background. So CJC is a managed services company. We’re a service provider. But what we are is very specialist and niche. We operate exclusively in the capital markets industry, financial markets. So we have offices in London, New York, Hong Kong, Singapore. We have people dotted around to places like Sydney, Carolina, and so on. And what we do is we support real-time data infrastructures inside the capital markets. So to give you an insight of what type of data that is, it’s data which comes from places like exchanges like the New York Stock Exchange, the London Stock Exchange and such. And they are normally sending their own data rounds straight to the clients, but also you get well-known market data vendors such as Thomson Reuters, Bloomberg, and there’s millions of instruments all updating, sometimes once every minute. Sometimes you’ve got thousands of updates every second.
Steve Moreton 00:02:07.794 But a huge amount of industries, and people need that data all around the world, and they need it very, very low latency all at the same time. And CJC’s been operating since 1999. And actually, one of our big clients is actually Thomson Reuters themselves. So what we do is we support their infrastructure in 20 data centers globally. So a few thousand servers, 500 clients [inaudible] have some things there. And we have a 24/7 support model around these mission-critical infrastructures of the clients globally. Now, when you’re a managed service provider, one thing that people have always asked us is: can we have visualizations of what our IT systems are doing? And this has always presented us a bit of a headache. So what I’m going to just show you on this first slide is how we used to do it before Influx. So just give me a second. So what we did was we had kind of basic information, just giving you things about CPU information, how many users are logging into the system, how many times per they’re mounting memory, very base-like statistics from the Linux, the operating system, network statistics, and so on. And as you can see, if we look at that, actually, we’ve got like October 2012 there. We’ve actually got one, two, three, so basically four, maybe five points of data there to give it.
Steve Moreton 00:03:35.910 So what we did was we were just culling the information from the infrastructure, storing it in a database, and very basically taking quite a lot of the granularity out and then doing these quite sort of basic sort of visualizations. I’ll just show you another one here. Here again, it’s very, very granular. The most granular you’re kind of getting is one update per hour there. And again, there’s a lot of things like CPU information memory. And it wasn’t really very, very detailed. And as we grew, you could see that this is more and more challenging. And we were actually having clients come to us saying, “This is a challenge.” So I enter a company, Royal Bank of Canada, and they came to us and said, “Look, we need an ITOA solution, an IT Operations Analytics solution. So we want to view all our infrastructure, and we want to have incredible granularity, and we want to be able to visualize every single algorithm to these servers.” So if you look at what they were doing, then basically, this is pretty much how we were doing it. So if they had X amount of servers, those servers had what we call metrics. It’s CPU information, Linux operating system, and so on. And then you had all the application specific.
Steve Moreton 00:05:04.016 You would have routers. You would have switches. You would have black box solutions. You’d have infrastructure from other companies, again, like I mentioned, things like Thomson Reuters, Bloomberg. And then they have other very, very critical infrastructures as well called Solace. Now, if you’ve got, let’s say, 50 servers, there’s potentially 256,000 metrics which are actually working on those servers like the CPU information. Now, the client was actually only storing 5,000 out of 256,000 metrics. And what would happen is that is like a real kind of finger in the air, so people will be guessing at the kind of statistics which will be interesting. So saying, “Right, let’s keep the CPU information. Let’s keep the hard disk. And then we’ll do sort of a handful of metrics from the application level.” And the granularity of the data was quite big. Again, you’re not actually getting one update an hour or one update a month. You’re getting potentially 32 updates per second. You could add a CPU, which is doing 32 updates per second with 32 cores all doing 32 updates per second. So that data would quickly get very, very big, which would cause a cost problem. So they could store the data, but what’s a problem with many of capital markets’ participants and many other industries as well is internal charging databases.
Steve Moreton 00:06:30.328 So for them, having a big database meant they were continually purging the database. They could only keep the data for a short period of time, three weeks, potentially. And they could only do a handful of the metrics. And then when they actually stored all the data, it was visualized in not really brilliant tools like Excel. So you would take a lot of the granularity out and you’d import it into Excel and then basically do—we know it very well—that full advent-type visualization in a graph. Again, I’ve just mentioned how expensive the costs were. But not only that, but the databases weren’t fit for purpose. So they were using, I think, it was Sybase. And for the kind of querying to get the visualization out, it wasn’t very suitable. So they gave us a task. So again, at first, we were looking at this old-fashioned solution. And obviously, almost within a day we realized that this wasn’t fit to purpose. We’re going to have to build this from scratch. And we looked at this, and we got our developers and our engineering teams, and we decided a raft of nice-to-haves we wanted, how we wanted it to look, the type of things that you want to support from both the visualization standpoint. But then we had all these things created. Why do we need—what type of database do we need?
Steve Moreton 00:07:56.390 And obviously, Influx came very quickly on top of the list. We were looking, we felt, very quickly for a Time Series Database. And one of the leading Time Series Databases in our industry is actually—I won’t mention its name, but it’s actually very expensive. And obviously, the client actually said that they were running away screaming from the costs of this particular Time Series Database provider. So we couldn’t use that. And obviously, we have to keep cost down, so we were looking for something potentially open source. And again, one of our developers, he felt very strongly that Influx had a lot of things which could be of value here, especially knowing how big the data was, but being able to keep the shape of the data over a long period of time, but then define the detail in data retention policies. And we also have the challenge of getting all these different providers in. So we’d have to have collectors for potentially all these different sources. So you’ve got well-known companies like Thomson Reuters and Bloomberg. But also you have the fact that, obviously, we’re using a middleware messaging platform called Solace to basically consolidate all the messaging into basically one messaging system.
Steve Moreton 00:09:12.721 So it gave us quite a lot of technical challenges. And we started to build the system. I’ll just give you some. So we did a POC. So they actually sent us quite a lot of their data in a flat file, which we then imported into Influx. We’d partnered with the company to give us a front-end. And then basically, we initiated the pile with live data. So at this stage, a lot of the clients we have are very, very sensitive about their data. Even though this wasn’t technically sensitive information, they do have these capital market participants. So we couldn’t use a cloud sort of platform provider like AWS or Google. We actually had to invent our own system as such. So we actually had our own cloud platform as a app and Equinix Data Centers in London. So because the client was in Equinix, and we were in Equinix, we did what’s known as a cross-connect from their data center into—well, their cabinet into our cabinets inside the same data center. And so we had quite a fast connection there. And then what they did was they streamed data from their [inaudible] infrastructure. So the POC was doing data coming from New York and Toronto, specifically.
Steve Moreton 00:10:33.616 And that went very, very successfully. We quickly realized we were getting about 100,000 metric updates per minute. And it became a very, very successful project, as I can show you there. It became [inaudible]. So I don’t want to do a death by PowerPoint. I think it’s best to actually show you the system. So I’ll just bring it up here. So the front-end is GUI, which is browser-based GUI. And in essence, what I’ve got here is this—is what we call the workbench view. This is the first tab. And then basically, as I mentioned, the data streams via a messaging system from, actually, a monitoring system. It’s a third-party monitoring system that the client already has on their side. We have a collector which streams the data directly into an InfluxDB database, which resides in our private cloud. And then what we do is we have a front-end, and that front-end queries the database. And that query engine is called a persistence engine. And what this means we can do is we can look at all the servers.
Steve Moreton 00:11:40.127 Now, I’ll just let you know that this particular client I’m showing you here is an absolutely brand spanking new client, and they went live with us on Valentine’s Day. And what I’m just going to do is I’m just going to grab one of their servers here, and then I’m just going to drag and drop it into here. And what I want to show you specifically is the metrics which we’re getting. So this particular server has, since Valentine’s Day, created 271 million metric updates. Now, I’ve got in this environment 45 servers. And I do capacity management with my capacity management server. So I checked this morning how much metric updates we got from all those servers, and we’ve got 9 billion updates which have been stored into an InfluxDB database since February the 14th, 2018. And as you can see, it’s quite nice to use. You just saw how quickly it called out all the information I’ve got available. And as I mentioned, we’ve got this ability to look at the data very quickly. So I can see what I mentioned were the baseline statistics there. I can see the CPU information. I can see the Linux, the network statistics. But then these particular servers have what’s called—this one’s called an ADS. And this is an application-specific tool.
Steve Moreton 00:13:11.118 So what I can do is I can just take just the CPU information alone from this server and drag and drop that in. And then basically, I can expand that out until just that CPU has created 263 million metrics on its own. And then basically, I can say we just have what the average CPU’s doing, and it can bring it up. I can also multiple-select other things from the database and drag and drop it in. So if I just get that average CPU—and this is one of the key features here. I’m just going to hold the live updates. So generally, when you see an ITOA system in any demo, you normally see this view here of where it’s live updating. And this goes back to a previous challenge—is that you could only store a short-term period of data. So it’s really good. It’s very, very easy to show the data as it comes in and you’re going back three minutes. But really to do ITOA, you need to keep the data forever. And as you can see, it’s a huge amount of data potentially being kept forever. So what you can actually do here is—I’m just going to expand this out. And then what I can do is I can actually send it—I know the date it first came in. So it came in on the 15th, and now I can say go. So now I can see what that average CPU’s been doing since Valentine’s Day this year.
Steve Moreton 00:14:43.622 And this was just quickly the persistence engine just interrogated the InfluxDB database and then showed it on the screen. One of the nice features is the ability to zoom in. So I mentioned a few minutes ago that developers—and when we designed this system, what we wanted is a Google Earth approach to the data. We wanted to have an accurate visibility of it so when you go onto Google Earth and you look at, let’s say, the USA, you get an accurate depiction of the data for the altitude you’re looking at. But then you want to zoom into the data. So I can zoom in like so. And then as you may notice in Google Earth, it’s a bit blurry for a second. But we can hit the Refresh button. And then what it’s going to do is actually bring in the more granular information to show us what that data looks like accurately. This actually is a very, very key feature that clients like because generally you don’t want to be able to just see everything as granular as possible. So again, this was that spike being mentioned. It took it from daily information down to it. And this particular client is sending us its 1 million metrics from their whole state per minute. So it’s very, very granular data. And again, I can just zoom in and show it here.
Steve Moreton 00:16:06.160 So again, this has all been pulled in as I do it from the InfluxDB system. So it’s very nice, and you can keep on zooming in, keep on getting more granular until what you get is into micro mode. So what I’m just going to do is I’m just going to go into just a little bit of dashboard. So we’ve got another system here. I’ll just go into this. So I’ll just show you some of the nicer features we’ve got. Obviously, I’m doing a demo on my new PC here. Just refresh this. So what we could do is—we could interrogate all the data. So I’ll just show you. This is a production client now, which can show you this. It’s the RBC site. So what I can see here is we’ve got the gateway, so we’re pulling in data from all their infrastructure from London, New York, Toronto, Hong Kong, Sydney, and Tokyo. Let me just go to first available and last available there and refresh that. We should be seeing—oh, there we go. They’ve just renamed the data. I didn’t see. But there we go. There’s New York. And then if I just remove this, I can then basically see what we define as a server. So you can see you’ve got quite a lot of things in here, like these are all different servers, all different components. And I did notice Influx there. So last time I checked, they didn’t have Influx, so they’ve obviously generated a new sale at RBC there, which is very good.
Steve Moreton 00:17:48.152 And I can type in absolutely anything I want here. So I can type in Bloomberg, for instance. If I spell it right. And I can find out all the Bloomberg appliances. I can type in something generalized like CPU. And I can find anything which is called CPU. And I can just change the way this looks. So I can go into what’s called the data view. So I can basically say some of them might have the CPUs being logged by the monitoring system’s top CPU. So then I can see all the servers. And it’s very, very easy for me to just drag and drop this. And again, the sort of secret power of this is the InfluxDB database. And what we can do is we can use the workbench to build up some nice visualizations. So I can show you one here. So we’ve got quite a strong dashboard. So this particular client who I should’ve renamed is using what—this is called—this is ethernet statistics, so how much megabits they’re sending. And again, we’ve got this granular view. But again, we’ve got the altitude for this month. Again, I can just say I just want to go and have a look at this for the last, I could just say last day. Yeah. And then it’ll go to the database and give me their last day’s worth of information.
Steve Moreton 00:19:10.994 And then what we’ve got is we’ve Influx. One of the key features we like is the detail in data retention policies. So if I just go to first available and last available, what it’s going to do—because they’ve only been on since the 14th. What it does is it jumps between policies very quickly. So what we actually have here is just the daily information. So the peak daily information per day. And I’ll just bring up another screen here. Nice example of it. Just open preset. It’s here. So what we’ve got is this client particularly has quite a lot of data here. And this data is currently going back to September the 29th. So we just pulled that information from Influx. And then what we’re saying is I’m looking at the last 178 days’ worth of data. So what I can actually do now is I’m just going to change that to 365 days. So if you notice here, it’s using the hourly policy here, so it’s showing there’s an hourly data. So you can see that’s a day. That’s a day. That’s a day. And you can see where the weekends are. If I click on Apply here, you’ll notice the shape of that data has just changed. And that’s where we’ve jumped from one detail policy to another detail policy. And again, what I can do is any point in there, I can go and say, “Well, I want to zoom into this here,” and that’s accurate for the altitude I’m looking at it, but now I’m zooming in, it gives me a different detail retention policy.
Steve Moreton 00:20:50.567 And the policies are key to how this works. So if I just show you this again. I go back to this screen. So this is showing me—oh, that’s part of April the 14th is when the first amount of data came in from these appliances, rather. And what happens is for the first 60 days, we keep every piece of information that comes in to us. So this particular client has got 250 servers on with us right now. And those 250 servers buy them up from the monitoring system, which is installed on those 250 servers. We get 100,000 metric updates per minute from those servers. And then we keep every single one of those 100,000 metric updates for 60 days. And then what we do is after 60 days, we change the policy. We change the policy from being what we define as all the data, the take data, to minute data. So we keep the min, max, and mean for the minute, and we keep that for a further 30 days. After 30 days, we do the same for an hour policy. And that takes us to a 180 days.
Steve Moreton 00:22:06.072 So for 180 days, the least granularity you’ve got is an hour, then a minute, and then 60 days of that. The initial 60 days you’ve got all the detail, absolutely all the detail. After 180 days, we only then keep one update per day. So this is where we can keep on showing that information for all time. So in five years’ time, I’ll just have every day, one update per day from 250 servers. And what that means is it means that we can keep the shape, but the data size doesn’t actually grow. So yes, it’s big data. It’s a lot of data coming in, but we know what it’s going to be. Other clients might have a requirement to keep the very, very granular data. But what we’ve actually found is that they only need to keep the most granular data for the 60 days. It’s more important that you can keep the shape. And this is great for when you’re creating thresholds, seeing the data. So as I mentioned, this is a capital market environment. And what we’re seeing on screen here is actually two of the most important appliances that this particular environment has. So they have a middleware messaging system called Solace. And they have a lot of things running on their backbone. And one of the key metrics is something called subscriptions. So it crosses, as I understand it from the amount of users to the amount of data on the backbone.
Steve Moreton 00:23:32.806 Now, these are their top two appliances you’ve got. GC6 and GC5 here. You can just open them up. So what I’m just going to do is go into some of the more enhanced features here. So I’m just going to add a trend line here and click on Apply. And you can see the trend line’s just been added. What I can do is I actually can do a bit of an end data projection here. So if I go to December 31st and click on Okay, we’ve got a nice little projection there. And you can actually see how accurate that is because you can see how the system’s been growing. And what’s good is this here is actually—it’s our good old mate, Donald Trump. He causes quite a few little market swing events here and there. I can’t remember specifically what this did, but it did cause a bit of volatility and a market swing event in the market. So if we’re only keeping that data for three weeks, that would’ve gone a long time ago. Now we can still see it. So I want to show you something interesting which happened. And this is quite a nice use case the client has. So what I can actually do is tell the database, yes, I want to see the last 365 days, but I only want to see it to this point. So I’m just going to put in Valentine’s Day again. It seems to be a key date in market data history.
Steve Moreton 00:25:02.982 And then what was actually happening is you can see that the trend line is going up. Now, we are partners with Solace, and we were working with them on the data, and they saw this statistic. And they said, “That is very scary.” If they get to 7 million—I’ll just have to put the threshold in there so we can see that, so you can see it on screen. Just put this in. So you can see this line here. They said, “That line there is where those appliances will both crash.” So Appliance 1 is actually going to crash on September the 29th, 2018. And Appliance 2 is probably going to be some time in 2019. So they had to start doing something. So the client was escalated to the fact that they could see this was happening. And what they could do is—I’ll just remove this. And you can actually see they did some [inaudible] and took some data off it. And then you can see the subscriptions went down. So now they’re going way under that threshold. And what we can do is with this is, again, it’s a great power. If I just put in this data here—again, I’ll just put in 18, I can equally do this from here. Just showing it from the second or third, zero one, and see how that trend line looks.
Steve Moreton 00:26:44.907 And again, depending on how much information you keep, it gives you radically different data from a trend line perspective. So if you’re only keeping a short bit of information, you’re not really going to get the stats that you want. And different teams in the bank have different requirements. So some teams—let me just clear this out from both sides. Some teams, if I just bring in where we are today, this is all the data—some teams might say, “Look, I just want the last five days’ worth of information.” And then they can see it. But they might be looking—yeah, okay. This is what’s going to happen. But they might only be interested in looking at what’s going to happen for the rest of the day as well. A different applications team might be looking at stats because they want to see how it’s going to affect the Solace appliances in the next few minutes. So again, they can actually go—we can just take that out and go to today, and then mark as tomorrow the 28th as being the end date. I click on Okay. And then what we can see is where we’re going to be. And we could then go and say, “Look, let’s just bring in the last five minutes of data,” and then click on Okay. And again, it’s actually showing us where we’re going to go there. So you can really select and use the data to get the information you need right now based on where you are.
Steve Moreton 00:28:23.585 The other beautiful thing that I always call this, the NASA formula for the end of the world or the end of the universe. And there’s lots of gaps in that data. They know the formula. They just don’t have the data yet. And this is something we found when you’re doing a lot of storage, assuming you’ve got the data and you can start seeing things. So if you’re looking for pattern matches—so I’ll just get to the right tab here. I’ve just got the—let me just bring it up in a different way. Here. So what we can do is, with Influx as well, is that we are looking for algorithms now. So we’ve got all this data. It’s very, very granular data. And what we can do is we can split that data up. So every five hours, we have something called CIMP running, which is Correlated Input Matrix Process. So what we do is we get all the data, and then we’re looking for data which goes over a certain threshold, which can make certain things stand out more. So this is where we can see all the data in one line then when there’s outliers coming up, we can spot them. And that means we can bring in a data scientist or somebody who can create algorithms to help us predict when events are going to occur.
Steve Moreton 00:29:49.125 So this is something that we’re doing right now. And we are actually rolling this out to our clients for testing at the moment. So this is, in effect, sort of machine learning. So what this is actually doing is giving us some great stuff. So our clients have come to us and said, “Look, can you do this for us?” And we’ve got one client who said to us, “I don’t know if my CPUs are good or not. Is my CPU’s running at 60% good or bad? Or is it running at 80% good or bad?” So he wants to know how things are shaped up to where he is on an average day. So what we did was very, very quick first to do, we’ve got this thing called comparative analysis, and it works well on this one, although, the CPU’s not doing much in this environment. This sort of faint green, hopefully you can see it, is what happens on most days on this particular server. And then what this purple is, obviously, what is happening right now. And then this sort of dotted line is where we’re projecting it’s going to go. And this is something where a client comes to us now and says, “Hey, we need this.” And then it really only takes us a short amount of time for us to build this. And a lot of it is just leveraging the power that Influx gives us. So this is still a bit of a work in progress. And I’m just going to have to get to the right screen here.
Steve Moreton 00:31:24.678 So what I’ve just done is we’re just showing you some of the things that this’ll end up looking like. So what we’ve got is event analysis. And we can actually get any event which is happening during the day. I might be able to show you this on that other side quite well. So we only just put this in yesterday for a client. So what I can do is very quickly see where most of my error messages are, and I can see what day it is. So I could basically say we got rid of the day and what the problem is—sorry, not what the problem is. But I can see very quickly it’s a loop. Most of my server problems are in here. As I said, I’ve just got a new PC here. So most of my problems are on this ADS user database. And actually, I can see which server is creating the most problems. And again, we can pull this out and then we can actually start learning how to keep that server from avoiding this issue in future. And then what we’ve got is the ability to do hot and cold servers. So what we can do is say, “Look, these servers are running very hot now compared to where they are.” And what these sheets are is where we get all the servers. So instead of having all the CPUs, I can ask every CPU in the bank to appear on one screen. And what I’m just going to get is a lot of lines, a lot of noise on that page. It’s not actually going to tell me anything.
Steve Moreton 00:33:06.028 But what we can do with the CIMP process is actually show all the data happening on one line. And one of the things which happens a lot is, obviously, in market data—if I just sort of set it to show me the last month again here—is we have days, and weekends, and evenings. We have market hours. So what you’re seeing is basically service running in market hours. So a market hour creating, which starts, as you can see around 8:30, and then it starts dying down here, the market closes the US time. So you’ve got London and New York running on these servers. So it tends to finish around 10:00 PM. So you got Monday, Tuesday, Wednesday, Thursday, Friday, weekend, Monday, Tuesday, Wednesday. What this causes is you’ve got quite big gaps in the data there. So what we’ve got the ability to do, we just put this in there. It’s got rid of the Saturdays and Sundays, and get rid of anything which isn’t market hours as well. So what we’ll have is we won’t just have this, which is nice to see. We’ll actually have more accurate reporting on how this is growing. It’s been a brilliant time-saving tool as well. So the way the old method used to work was you would export the data out. So you take the data, you take the granularity out, you would then put it into Excel, and then you do the classic F11. That would take an engineer, potentially, a whole day.
Steve Moreton 00:34:42.812 What we found is that we can actually do this very quickly. So I might as well show you now. I’ll just go to this tab. But it might pop up a quick error message here because this is a new client we got alerting on. So what I can do is say, “Show me the first available data,” which was the 14th of Feb to now. And I’ll bring this in. So just doing this the old way would’ve taken an engineer a day. But we’ve been able to do that in a couple seconds. Now, this is something that is called Snapshot Request. This is where the server’s asked to bring in all the market data information as a snapshot right now. This is one of the most highest sort of things which impacts CPUs, and it’s something that people haven’t really seen before. It’s something that, yeah, you can look at CPU information and Linux information, but this is information which is very specific to the application. And when we store this, nobody actually had any idea how much this is growing. So you can see, actually, from February the 14th to now, the client’s doing actually 435 million snapshot requests per day. As of last week, they’ve done 540. So the increase was by almost 100 million snapshots over quite a short period of time. What we’ve noticed is yesterday and today, they were actually only using [inaudible] now. But we can visualize it.
Steve Moreton 00:36:15.249 And what we’ve actually seen across the entire client state, there’s been a growth from 750 million snapshot requests to well over a billion. All this data is fantastic. As I said, I’ve got 9 billion metrics and counting. I’ve got 130 million—every server gives me 130 million new metrics every day. And again, it’s got some fantastic power we can add. We’ve been doing things like URL-driven templates so that we can associate this into third-party systems. So I’ve shown you the system a bit. What I’ll just do is I’ll just show you a little bit how it works. So we have, just to review, we’ve got private secure cloud-based infrastructure offering. So what’s key for this is that the infrastructure’s key to how it works. We’re potentially having 1 million metric updates coming into those databases every minute. And one of the challenges was before it was hardware, so cloud infrastructure gives us elastic scaling. So we can pull in resource as and when we need it. When things go hot, the system correlates with that. And it also gives us very, very high-performance storage. To do big data visualization can be very costly. As I said, the clients we use in the capital markets, this is very, very sensitive information. So the things I’m showing you, you can’t see on the normal Internet. You can only see them from our offices or the client’s offices. But the clients connect via very, very—there’s lots of data coming in. So they connect via cross-connect or we use something called VPLS, which uses a non-dirty Internet. And we’ve got clients coming in from Chicago, New York, I said all those locations. So that’s sort of done on the non-dirty Internet straight to the clients.
Steve Moreton 00:38:09.214 We have to have collectors coming from different sources. So we have monitoring systems. Most of these infrastructures have what’s called an exhaust. So we just pull it. And we’ve got multiple different messaging systems which pull in. So we use Kafka, but we can also use the Solace system as well. All important is the Influx Time Series Database, which has been designed for the big data storage for the queries in our system projections. I’ve just told you about the data scope vectoring we do. Again, that’s something that we could’ve only done pretty much with Influx [inaudible]. And I’ve talked about the things—we have to monitor this 24/7, just looking at the infrastructure as it sets. So this is pretty much an average client. So what they have on their client side, if need be, is those messaging systems. So clients use Kafka or open access Solace. And they stream the data from their monitoring systems into our environment. So because the data’s so big and volatile, it’s like getting the data into the database at first is a little bit like wrestling a [inaudible]. So we bring it in and we have to do it into a QA area. And we onboard it in QA. And one of the good things we can do is that there’s a lot of data. We subscribe to all the information going to the clients. But before we gather it, we actually say, “Look, we don’t need to see these servers, or these particular views aren’t important.” So we can do very, very quick onboarding there.
Steve Moreton 00:39:45.969 And then what happens is then we move the client to the production site. So we have the web server, which shows us the data. And then we have the database and persistence engine. So generally, we have about three persistence engines running on the service. We scale in terms of what we call data views and [inaudible]. So it’s basically how many servers, what’s being monitored, and what the granularity of that monitoring is. And then one thing I haven’t touched upon really is this area here, which is how we define this as eating our own dog food. So I’m just going to go and show you the little screen we’ve got here. So what I do is all the clients that I’ve got on, I do management of. So I can see and do comparison. So what I’ll just do now is just see if I do this live, which always ends in [inaudible]. I click on Okay. And then what I’m just going to do is I’m just going to look for what we call persistence. Actually, device. Bring it up as the data view. And then what I’m just going to do is drag and drop this into here. Just make sure that was the correct one. But what I was hoping to show you—I probably brought the wrong dashboard up—is that where I can see is compare things like the database writes. This will help me to scale my system. So I use a capacity management system to do capacity management of the capacity management system. We’re coming up with about 15 minutes left. So I think I’ve done enough talking for now. So Chris, I don’t know if you want me to open this up to the floor a bit.
Chris Churilo 00:41:59.574 Yes. In fact, you do have a question in the Q&A panel. So Panykou asks, “How big is the cluster service in the dashboard?” And by that, he means the data collectors, InfluxDB servers, etc. And he just wanted a little review of the architecture. And he asked this question a lot earlier on. So I know you covered some of this stuff.
Steve Moreton 00:42:23.519 Okay. So what we provide clients is that we give them 500 gigabytes of data. And we found that Influx is really good at storing the data. So we find it as a mod, which is a MosaicOA data set. As I said, the monitoring system we use, that can give us anything from 45 servers a million metrics a minute. And again, some of the capital market participants use very, very specialist software. I mentioned things like—just bring it up here—like TREP and things like that. So what we’ve found is for 250 servers, we’re doing what is called six months. And in fact, it’s three servers which we place into the cloud. And each one of those servers, six months, has 500 gigabytes of data. Now, the beautiful thing about the data retention policies, as I mentioned before, you get the data in very, very granular, as granular as you like. And we keep that for 60 days. And then that is where the big lot of that 500 gigabytes of data is used. And then after 60 days, we move to a minute for another 30 days. And then it goes to an hour. And then after 180 days, it goes to daily information. So obviously, after six months of building all that data, then capacity kind of slowly grows because the only new data coming into it is in effect the daily update now. Everything’s being sort of filtered through.
Steve Moreton 00:44:08.703 So we do have clients who want to change the detail in data retention policies. So they might want to keep all the granular information for all time. Other clients have come to us and said, “Hey, can we have it where we only have the granular data for two days? And we can then move into the more, how are we in daily policies?” But this is great. This means we can shape and size it. Obviously, the use case here is what’s known as ITOA, IT Operations Analytics. We’ve got all the clients who very much settle in. We’ve got to keep everything as granular as possible. So it then becomes about how long is a piece of string created? How big it needs to be. Right now, the main use case is ITOA. So we just do a lot of things there. Each cluster is, I think, 500 gigabytes. I’m happy. I’ve got the question jotted down. I can give you the accurate statistics as a follow up. Hope that answers you question.
Chris Churilo 00:45:07.992 So Panykou, just let us know if you need more details behind that question. And as Steve mentioned, he’ll get you more details. So as you spec this out, and then you actually started implementation of this solution, what were some of the unexpected challenges that you had with all this data? I mean, watching your demo, it’s super impressive. Everything’s in real-time. But I imagine that you probably hit some bumps before you could get to the place that you guys are with your product. Maybe you could share some of those experience, especially if there are some words of wisdom that people should avoid.
Steve Moreton 00:45:47.153 Absolutely everything was a huge headache. We’ve got about 13 people who work on this project. And we have fallen out so many times. And if it wasn’t for the shared vision of what we’re trying to do, we’ve just been [inaudible]. There’s been screaming and everything. The one thing I will say—the one thing we’ve not had a massive problem with is Influx. And I’m not doing this because I’m on the Influx webinar. We’ve had a great relationship there. You’ve always been supportive. So I have to say the visualization front-end’s been great. The first challenge was realizing how big the data was going to be. Again, we were used to storing 5,000 metrics from these servers. And then actually, a server can have quite a lot. I was just going back to that previous question that the specialist vendors have quite a lot of update rates. But what we define as baseline, we think of one of our mods, in effect, one server. That can visualize 100 servers if it’s just things like CPU, Linux, operating system level stuff, basic. And then you can tune it so that you don’t keep as much granularity. Are you really going to get—you’ve just got to make those decisions accurately.
Steve Moreton 00:47:14.107 So going back, the challenge is, again, realizing how big the data was. We sat there, and we looked at the monitoring system, and we looked at everything which is being monitored. And when we saw there was going to be 50 servers—and we were thinking it’ll be about 50,000 things. We were just kind of guessing. When it came out at 256,000 metrics, the development team went, “Why?” And it genuinely scared us. And we had to walk away and really start designing stuff to work. The problem was if you can get the data into Influx, that’s fine. That was kind of the challenge. So this is where we had to lead on, having the cross-connect because we knew it was going to be quite a lot of data coming in. Then it was kind of the messaging system because you are at the mercy of messaging systems. So again, where you sort of talk about Kafka and things, we’ve had a lot of challenges with the messaging systems coming from the monitoring systems and the collectors. So they can just stop giving you the data. So you have to do a lot of tuning about that. But we’ve never specifically had any problems with Influx. There has been things like where we sized it. I think there was a memory thing that we worked on. Again, we just sent that over. We have a support agreement with Influx. They provided a solution there. And we’re in constant contact.
Steve Moreton 00:48:39.208 So one of the challenges we have is let’s say you’ve got this server here, and then the client re-purposes their server or they put a new server in at the same spec and a different name as allocating the data from the old server to the new server. So it’s very straight-forward to do. We can just do a backup and restore and tag it to the new server and things like that. And we’re working with Influx in a way that we can do that a little bit easier so we don’t need to use the development team. We can get our operations team to do it. Another problem was database writes. That wasn’t, again, a problem specific with Influx. That was a problem from the infrastructure. We had to do quite a lot of tuning. We work with a fantastic partner called [inaudible], who provide us our infrastructure in Equinix. And they really, really worked hard on this. And they’ve just invested quite a lot of money. Originally, we were on spinning disk technology. They moved us onto SSD-based technology. And they’ve just moved us onto brand new down storage array. And that’s really, I think, some of the things you may’ve been seeing that have been quite impressive of how quick the data pops up. It was quick anyway, but now it’s just sensational, the data we’re getting. And the other challenges are just sales challenges, really. But technically, we’ve got through it. We’ve got a really good system going there. So hopefully, that answers that question.
Chris Churilo 00:50:15.347 That’s interesting. I really kind of—especially the idea that things can get re-purposed. In a way, not only is this a monitoring solution, but you also have to kind of have it be an asset-management solution, so you know which thing is pulled out, where it’s going to be re-purposed, and then as you mentioned, applying that historical data back to this re-purposed asset. I can imagine that gets a bit tricky.
Steve Moreton 00:50:41.893 It’s been very, very successful this has for CJC. We’re a service-based company. And this is a product we’ve spent a lot of time on. We’ve actually re-learned and learned some new skill sets and so on. But it has proved very popular. So the original client who’s had it, they’re now moving this, not just from 250 specialist servers, it’s those baseline servers. And we were just discussing with them about doing visualizations to potentially 4,000 servers. But some of our other well-known clients are moving to this. We’ve got asset managers who are coming onto it in Chicago, New Jersey, Singapore, very, very well-known capital market participants. But then it’s really got the attention of some of our other partners as well. So they’re looking at things like sports betting industries and so on, the generic sort of side. And ultimately, when you’ve got 4,000 servers, you’ve got a lot of changing, you’ve got a lot of things going on. Potentially, every weekend, there could be 100 servers being re-purposed. So this is where these are nice challenges to have, of course. But this is where it’s great to work with Influx as a partner because they can go up and we can say, “Look, this is what we need. Can you factor this in, potentially, into the next release?” And it’s like, “Sure. That’s great. Let’s do it.” And it’s really good to have that. Collaborative partners have been the key for us. Working with the best of breed. So there has been a learning curve for all of us, but we’ve got through it together as partners.
Chris Churilo 00:52:20.956 We have a couple of questions, but before we get to them, I just want to remind everybody on the call that if you have questions, now’s your chance to put it into the chat or the Q&A panel. So back to the Q&A, so Penny Coast replied back that he would appreciate any of the offline details that you promised. So we’ll make sure that you get connected with him. And then we have a question. So how long did it take your team to develop this entire product, both web interface and the back-end clustering?
Steve Moreton 00:52:51.724 In June 2014. So as I said, we were doing this—our first type of visualization was being done while we were doing it internally on email. But then we moved to an age of US-based cloud-based system, which I showed you very earlier the more basic version. And it was June 2014, I went to New York, met my client. And he said, “This is what we wanted.” And I said, “Hey, we could do that.” And at the time, I thought we could. I’d not really got involved in this before. And I quickly realized that what we had wasn’t fit to purpose. So the front-end is developed by a team I’ve worked with for quite some time. We’ve worked with the—CJC’s been going since 1999. And the chap who runs the front-end company called Corellasoft is a good pal of ours. And he’d actually been developing this front-end, just the front-end, for a good couple of years. So I think he started around 2012 on the front-end side of it. So he partnered up there. We got the front-end sorted. But between June and September 2014, Influx and how we did the structure, that did not exist at all. We started to hire developers and partner up with development teams. And basically, they gave us the dump of data which we imported into Influx in September. And then basically, they loved it. They just loved it straightaway, the front-end, and Influx, and the databases. Just running on a laptop, in essence, I think it was. That was enough for them to move into a POC.
Steve Moreton 00:54:37.763 So that’s where we have to do the hard yards. And they were really adaptive, so they knew we had done this. It’s so important to have a client who’s willing. And we did it as a free POC. We didn’t, obviously, have to invest money and time into it. But in, I think it was, 20, so we moved to 2015. Just got the dates somewhere—2014, pre-production was in—we completed it in 2015. I think we went live early 2016. So it took us about 18 months. But obviously, parts of that was already pre-created. And we’re seeing other companies which are using—there’s Grafana, there’s [inaudible], and things like that. But we are a niche company, and we have to focus on specifics of what our client base needs. And we’ve not found that they’ve really added—Grafana’s not [inaudible]. There’s a lot of work to re-jig it where we’re kind of out of the box there. And the developers were quite [inaudible] with InfluxDB. So as soon as it went live, we’ve not really had a huge amount of issues. We’ve just been fine-tuning that. So I’d say about two years it took us to get it to where we were happy, that it was where we needed it to be.
Chris Churilo 00:56:04.308 Not bad. I mean, obviously, you guys would listen to what your clients needed and that’s why you’re able to make so many quick changes. Scott asks, “What are your preferred Linux OS collectors? Are you using JBM collectors?”
Steve Moreton 00:56:17.436 I think that is correct. That is something I might need to—I’ve got my colleague, Pete, on the line. Have you got—?
Peter Williams 00:56:24.216 I can show you on mine. So it depends. We’ve got a sort of selection of collectors. Sorry, I’m getting an echo on the line. I’ve got to take my headphones out. Yeah, so we have a selection of collectors that we use. So the demonstrations you see today are actually using—they’re hanging off of a sort of bespoke monitoring platform, which is sort of the leader in our industry. But we can port from more generic collectors. But I can speak to the team, and we can get some more details about that.
Steve Moreton 00:56:58.927 Yeah. One of the things is it is quite complex. You’ve got the infrastructure. You’ve got the database. You’ve got the front-end as well. And so I’m the product manager, and so I have to know. I’m a Jack of all Trades, not master of one, as it were. So that might be something that I can get a bit more information. Happy to provide it.
Chris Churilo 00:57:18.716 All right, Scott. We’ll get you that information later on. I’ll get you connected with Steve and Pete. One more question. So are you using any of the other InfluxData projects, like Kapacitor?
Steve Moreton 00:57:32.169 No, not at the moment. Pete, again, you own the relationship with Influx, so perhaps you can give us some insight.
Peter Williams 00:57:38.995 Yeah. I can speak to that. So recently, for one of our other sort of efforts with our Dev team and our innovations group, we actually are speaking to Influx about trying to look at other elements of the TICK Stack in relation to actually monitoring and capturing metrics around Kubernetes clusters and looking at ways in which we can pull statistics and information using that. So we are actually about to embark on utilizing a wider array of Influx tools.
Steve Moreton 00:58:18.515 We are innovating. So obviously, you saw some of the CIMP stuff. One thing our development team—and the industry as a whole in the capital markets—there’s a lot of attention been coming to chatbots. So instead of having to type things in or just drag and drop things, we want to just be able to ask a question. And so our development team are working on that. We’ve got a lot of interest coming from places that we don’t currently reside in. And in the capital market, sometimes you have regulations so data can’t leave the environment. So we’re using our system, and we’re working with our partners on this, but potentially, we might have to spin something over, potentially, to the US. So we are and have looked and investigated things like the InfluxCloud as well there. So the thing about having a product is you’re always 10% behind where specifically you want to be. You’re always growing and developing. And we’ve always got an eye on what the industry’s doing, what our partners are doing, and what they’re able to provide, which can enrich the product as well. It is a cloud-based server, so what we find is that we’re on version 4.8 now .16. And every two weeks, we’ve got a new version of front-end because we’ve got so many clients coming into us saying, “Can you do this? Can you do that?” And then really, we’re always there trying to say, “Look, do we have to invent this? Or is there something out there that can do it already?” And this is one of the beautiful things about modern technology is that you have the cloud, you can split things up, you can test things out quickly. There’s probably somebody out there who’s already thought of it. And you can just wrap it in very quickly.
Chris Churilo 01:00:04.044 We got another question that came in from Dimitro. So he asks, “Do you have KQIs or calculated values? If yes, when and how do you calculate them and store them?”
Steve Moreton 01:00:16.211 What we do is, I’ll just quickly show you something here. I did show you earlier. We have to make different databases. So we have the core database, and then we can kind of have offshoot databases where we put certain information from the core database and then wrap up. So I showed you this one where it’s just taking specific errors and so on. And then let me just go into the CIMP process here. So again, what we do is this is where we’ve got secondary databases which are running. So what we do is while we’re looking for specific things in the database which are above and beyond, so we’re looking for certain numbers in there. And then we do a scan through the original database structure and then create a new database. And we’re constantly being asked questions like—I’ll just go into another thing. So a client wants to look at a specific thing. So if you just refresh this. Again, they’re looking at new ways of interrogating the data. So generally, the client gives us a challenge. Like this client wants particularly this kind of view of the peaks. And in some ways, one of the challenges of the proposition is, is it a monitoring system or is it a capacity management type of visualization system? So some of these things you can actually get at quite easily from a monitoring system. You don’t need to use this.
Steve Moreton 01:01:45.280 But the data’s there. So I’ll just wait for this to come in. So I’ll just get something, some interesting stats or something like this. I’ll just go into this one and get write rates. So I can say for all my servers, “Send me all the write rates here coming from these specific servers.” And then what we do is we just interrogate it. And this is something we have to do on the database level. This isn’t something the front-end does. This is something which is synced in here. So you can see we’ve got data source here. And then we’ve got all these different things that we can do. So yeah, we do do a lot of calculations in the database. And it brings client [inaudible] as we come up with ideas, we just go into and say to the development team, “Can you do this? Can you do that?” And it’s never a problem. It’s never a problem. Hopefully, that kind of answers your question.
Chris Churilo 01:02:43.442 Yeah. So Dimitro just asked—let us know if that answered your question. It was a really great question. And I can see how with this kind of amount of data, you could definitely want to dig in, dig in and look at different views and aggregates. We have another question. Oh, Dimitro says, “Yes, thank you.”
Steve Moreton 01:03:00.313 Great. Happy to help.
Chris Churilo 01:03:02.240 Another question that we have is, “Did you modify the source code at InfluxDB or did you just use InfluxDB out of the box?”
Steve Moreton 01:03:09.580 I’m going to say no to that one. I really don’t think we did.
Peter Williams 01:03:14.486 No, we didn’t.
Steve Moreton 01:03:14.492 You know what development teams are like [laughter].
Peter Williams 01:03:21.272 No, they only upload, say, from what I understand, we didn’t modify any of the code. But because of the relationship with Influx, we’ve been able to feed back into them about certain features and tweaks that we’ve wanted. And in many cases, they’ve agreed that these are enhancements, and they were actually important to the core product. So we utilized Influx as it came, but then we kind of fed back in [inaudible].
Steve Moreton 01:03:47.594 It just goes back to what we were saying earlier as well is that generally, we like working with partners, so it’s kind of like questions come in. Sometimes we do a work-around, but then we sort of feed back into Influx’s [inaudible]. So Pete is my CTO, and he owns the relationship with Influx. So Pete has quarterly co-op with Influx where he goes through a huge amount of different topics. There’s obviously other things that we can do and other things we work on as partners. But yeah, we have this really good relationship with Influx so that they sort of have features and talk saying, “Okay. This is going to be added to the roadmap there and so on.” We send to them, obviously, if there is the odd issue here and there. We send there often, and that’s factored into new releases and so on. So yeah, great to be part of the community.
Chris Churilo 01:04:37.435 Yeah. And talking with Paul, I think a lot of your requests are reasonable, and they actually make sense for the larger community, not just for CJC. So makes it really easy to prioritize those things.
Steve Moreton 01:04:50.295 Great.
Chris Churilo 01:04:51.103 All right. One more minute. Any more last questions? The question about modifying the source code was appreciated. All right. Well, if you do have any other questions, no problem. Just shoot me an email, and I’ll forward it off to both Steve and Pete, and we’ll get those answered for you guys. And we’ll post this recording later on. And if you want to learn more about this Mosaic product, you can of course go to the CJC website. Alternatively, Steve will be speaking at InfluxDays, so you can actually meet him and Pete in person if you happen to join us in June in London at our event. And I really encourage everybody to do so. So with that, I think we are good with questions. Let me just do one more check. And I think we are. So thank you so much to our wonderful speaker and also our backup speaker. Thanks for joining us. And we hope you guys—we just got one more question. Scott says, “Thank you so much.” And we hope you guys have a great rest of your day.
Steve Moreton 01:06:07.906 I’ve just had my legal person just jump in as well, so I’ve just got to say you’re all under NDA now. Apparently, some of the screenshots I shouldn’t have shown [laughter]. But really appreciate the opportunity to talk about our relationship. Thank you so much again. And wish everybody a great evening wherever you are in the world.
Chris Churilo 01:06:31.137 Thank you. Bye-bye, everybody.
Peter Williams 01:06:32.665 Thanks.
Steve Moreton 01:06:33.458 Thank you.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.