How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using InfluxDB and Telegraf
Session date: Jun 30, 2020 08:00am (Pacific Time)
Discover how Sysbee helps organizations bring DevOps culture to small and medium enterprises. Their team helps their customers by improving stability, security, scalability - by providing cost-effective IT infrastructure. Learn how monitoring everything can improve your processes and simplify debugging!
Join this webinar as Saa Tekovi? and Branko Toi? dive into:
- Sysbee's introspection on monitoring tools over the years
- How TSDB's, and specifically InfluxDB, fits into improving observability
- Their approach to using the TICK Stack to improve the web hosting industry
Watch the Webinar
Watch the webinar “How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using InfluxDB and Telegraf” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using InfluxDB and Telegraf”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Saa Tekovi?: Linux System Engineer, Sysbee
- Branko Toi?: Linux System Engineer, Sysbee
Caitlin Croft: 00:00:04.096 Hello everyone. My name is Caitlin Croft. Welcome to today’s webinar. I’m super excited to have Sysbee here, presenting on how they are using InfluxDB. Once again, if you have any questions, please feel free to post them in the Q&A box or in the chat. We will be monitoring both. And what I’m going to do, I’m going to hand it off to Saa and Branko of Sysbee.
Saa Tekovi?: 00:00:35.734 Hello everyone, and a big thanks to everyone who decided to tune in to our webinar. I’m Saa Tekovi?, and I’d like to begin by going through the day’s agenda. We’ll start with a brief intro about our company and what we do and then we’ll talk about our monitoring requirements. We’ll cover some interesting details about monitoring tools that we’ve used over the last 20 years and challenges we experienced along the way. Finally, we’ll talk about how today we monitor customers’ infrastructures using Influx data products.
Saa Tekovi?: 00:01:15.355 So I work as a senior Linux systems engineer at Sysbee. I’ve been working in the web hosting industry as a systems engineer for the past 11 years and I have experience in designing and maintaining various types of hosting platforms, such as shared, VPS, dedicated server, and private clouds. I enjoy simplifying things for our customers, which are in most cases, developers. For example, when designing a complex infrastructure which consists of many clustered services, I believe that it’s important to introduce an abstraction layer which allows developers to manage their application as if it was hosted on a single server. I’m a big fan of InfluxDB and Telegraf because it makes my work so much easier, but you’ll be able to find out more on that in the following slides.
Branko Toi?: 00:02:10.388 Hi, all. My name is Branko. I’m doing the system administrations, mostly on Linux systems back from ‘98. Currently, I’m specialized in engineering and maintaining medium to large server systems primarily used for web application delivery. I’m also working in the web hosting industry for the past 15 years now and my passion is collecting and analyzing data. So naturally monitoring is my primary focus. I also like to evaluate and implement new motioning tools and practices whenever and wherever possible. And besides that, I’m doing some internal tooling and system development in Python.
Saa Tekovi?: 00:03:01.219 So just a quick overview of our company. Our roots go way back to 2001 when our founding members entered the Croatian web hosting scene as Plus Hosting. In the beginning, the company offered not only Windows shared hosting but later introduced Linux shared hosting platform as well. Back then, the shared hosting infrastructure was based on rented physical servers in the UK and US data centers. In 2006, we deployed redesigned infrastructure using our own hardware in a data center in our capital city, Zagreb. With shared hosting still being our core business, in 2008, we introduced managed VPS and dedicated server hosting to meet increasing demand for more capable hosting solutions. In 2010, we added managed services to the portfolio as the need for tailored infrastructure solutions began to appear. Then, fast-forward a couple of years. In 2015, we joined DHH, which stands for Dominion Hosting Holding. DHH is a tech group that provides the usual infrastructure to run websites, apps, e-commerces, and software-as-a-service solution to more than 100,000 customers across Southeast Europe. DHH is listed on the Italian stock exchange and under its umbrella holds a number of hosting brands from Italy, Switzerland, Croatia, Slovenia, and Serbia. As mentioned earlier, Sysbee was established in 2018 to meet increasing demand for managed services. We recognize that some customers prefer to host their projects on-premises or with public cloud providers, such as AWS, Google Cloud, and Digital Auction, so we are proud to say that we are platform agnostic, meaning that we can assess, design, and maintain infrastructure hosted with different providers and not only on our own hardware.
Saa Tekovi?: 00:05:01.036 To give you a better idea about our managed services, I’d like to mention a few of them. We offer infrastructure assessment, a service aimed at customers who want a complete assessment of their infrastructure in order to find out where and how can they improve performance stability, security, and optimize infrastructure costs. Our managed infrastructure service covers everything from infrastructure design to continuous monitoring and maintenance with 24/7 technical support. This means that we are collaborating with our customers from the very beginning - from the very beginning of their project/idea in order to identify customer’s requirements and goals. At the same time, we’re also collaborating with developers working on the project to meet their requirements and provide them with a developer-friendly environment in which they don’t have to deal with the complexity of the infrastructure that hosts their application. Our managed AWS is basically the same as managed infrastructure service with special emphasis on services in the AWS ecosystem, managing security of AWS accounts, as well as optimizing AWS costs. Besides services, we also offer a couple of managed products. For example, Magento optimized hosting, which is ideal for hosting solutions for medium to large Magento store owners. Hosting plans come with a preinstalled toolkit for Magento developers and are preconfigured for the best possible performance. Managed GitLab is a relatively new addition to our products portfolio. It’s ideal for companies that require a dedicated GitLab instance, but don’t want to deal with maintenance, backups, and managing security of their GitLab server, or simply have special requirements that gitlab.com service can’t fulfill. For example, to choose the specific country where the GitLab server must be hosted or to ensure that the GitLab server is reachable only via a VPN connection.
Saa Tekovi?: 00:07:10.032 A few words about our typical clients. They’re usually small to medium business whose projects vary from e-commerce sites, news portals, API servers, and software-as-a-service projects. For example, ERP solutions, event ticketing services, etc. Our client’s projects can generally be grouped into three categories. Small projects, which in most cases, have one to three standalone virtual or physical servers. For example, database application and caching server. Then, we have medium projects, which have more than three servers, of which two or more are generally placed behind a load balancer. In some cases, most critical components of the infrastructure are redundant. And finally, we have large projects, which consist of five or even a greater number of servers, with high availability and full tolerance, since all parts of the infrastructure are redundant and clustered. In most cases, such large projects also feature auto-scaling, where part of the infrastructure automatically scales, depending on the resource usage. All right. That was a short intro about our company. Now, I’d like to pass the mic to my colleague Branko, who will tell you an interesting story about our monitoring system and how they evolved over time.
Branko Toi?: 00:08:36.312 Okay. Thank you, Saa, for a great introduction. I hope that this gave a clear picture of who we are and what we do, as well as what our typical clients are. So keeping in mind that we are mostly dealing with web services and managing web servers, our monitoring requirements are aligned with that. On the other hand, we aren’t doing web application development or maintenance. So we are in a bit of a pickle when it comes to monitoring application health. Nevertheless, we still like to add value in our support and maintenance plans. So this will usually mean that we will try to monitor our client’s application health indirectly. Nowadays, it goes without saying that most of the serious websites require a near 100% uptime. So we also expect our monitoring stack to have some form of alerting and trend tracking, which can help us to predict failures, or at least mitigate them, as soon as possible.
Branko Toi?: 00:09:51.590 As Saa previously mentioned, we also do remote system assessments. And there are some cases where our clients will opt-in for one-week metrics collections. These help us to better understand and analyze their infrastructure workloads, so we can finally give them a better suggestion for optimizing their systems. So in those scenarios, we really need an easy to use metrics collection system that will not affect the system too much. I think that you all know where this is going, but allow me to tell you a short story about the history and the evolution of our monitoring stack, so let’s go [laughter].
Branko Toi?: 00:10:39.501 Our story begins back in 2001, and in those days, as Saa mentioned, we were starting as a small company, then known as Plus Hosting. Our primary focus was only a regional shared hosting market. And at the time, we were counting our servers on fingers of one hand. What’s worth keeping in mind, that for our region at least during those years, internet access was just starting to get a foothold in people’s homes and they used the internet mostly for reading their emails, very occasionally, once or twice a day, and smartphones were a very distant future. I would also dare to say that this was mostly an offline era for the online services. We could argue that the uptime requirements back in the days were not as demanding as they are today. And where there is no demand, there is no supply. So in that regard, monitoring software was scarce and hard to find or configure properly. With the introduction of Linux servers in our portfolio, we slowly started to implement MRTG with some basic metrics collection via SNMP. And other than that, we had a handful of custom-made scripts that would be executed via a remote system and then that would trigger basic service alerts that were dispatched via email.
Branko Toi?: 00:12:21.732 In the year of 2005 - 2006, several changes were happening. First of all, internet in our region suddenly became available to more people. There was a clear boom of websites and that meant more business for us. But somewhere around that period, GIGRIB was introduced as our first public website monitoring service. You may know that today as Pingdom. Their premise at the time was very interesting. You would install a B2B service on your computer that would monitor other sites and you would gain credits that you could exchange for monitoring your own web site with HTP or ICP checks. So now what that meant was that even though the smartphones were still not available, and this was still - you could call it - an offline era, enthusiasts were starting to monitor their website for uptime and then comparing web hosters against each other. For us as a web hoster, this was a good incentive to step up our game in server monitoring. Because trust me, it’s rather embarrassing to be notified by the customer that your server or service is offline.
Branko Toi?: 00:13:47.951 Luckily, Nagios started gaining popularity around 2005. So naturally, we had to investigate it and put it to good use. You may call me old-fashioned, but I still see a great value in Nagios dashboards, which provide a quick glance over a large number of monitoring services. And on top of that, to this day, I still haven’t found as a good alerting manager that will handle scheduled downtimes, alerting rules, routing, and escalation procedures. Even though it was very good, Nagios didn’t cover all of our monitoring requirements. So, for example, long-term metrics for trend analysis. Yes, there are some performance data collected with some of the Nagios plugins, however, there is no real history or graph representations with default setup.
Branko Toi?: 00:14:48.822 So somewhere around 2007, we started using Munin. And this tool came in bundled with cPanel, which we were using as a primary shared hosting platform at the time. We were instantly hooked and we began to extend its metrics, plug-in based, to collect even more metrics, and we started deploying it to non-cPanel servers as well. This certainly gave us some more insights into the history and trends on each of those hosts, but there was also this usability issue of decentralized monitoring information. Basically, each server collected its own information and stored them, so you would have to go from server to server to see how each of them is performing. You didn’t have any possibility to compare hosts and metrics easily, and dashboards were mostly predefined with only slight modifications as possible. At some point in time, we also deployed a central Munin server, where you could aggregate old metrics. However, as our server count grew, this failed miserably. The main issue was the disk and the CPU I/O on the central Munin server, which was used to store and render the graphs for that large amount of metrics.
Branko Toi?: 00:16:18.161 So a few years have passed, and somewhere around 2012 or 2013, we replaced Munin with Ganglia. At the time, Ganglia was very well-established. It had support for rrdcached, and it had an easy to use web interface that centralized all the metrics in one place. What you have to bear in mind is that during this period, Grafana still wasn’t released publicly and the Ganglia web interface was full of features. You could search metrics. You can create your custom dashboards. You can create metrics comparison graphs, time shifts, where you can compare metrics to itself in the past, or you could organize hosts and services, and so on. So collecting metrics with Ganglia was also very easy. You could extend its gmond collector with custom-made plugins, or you could use its gmetric to push custom metrics even from the simplest [inaudible]. Suddenly, we were collecting well over 600 to 1,200 metrics per server and we were starting them in RRD.
Branko Toi?: 00:17:39.917 Disk I/O was still an issue, but it was not pronounced when you would be using rrdcached compared to bare Munin. The disk I/O rights would balance out over time and the server could manage to store even more metrics. On the other hand, what I must say, Ganglia wasn’t the easiest system to configure or maintain for that matter. There were some issues with the plugins and their plugin bugs, where we would lose all or some of the metrics for some server or service. This was naturally bad and we had to develop some kind of a meta-monitoring system so that we could catch those issues early on. Besides that, we were forced to better standardize some other unrelated things and to organize our servers into monitoring groups or create custom puppet modules. And all this gave us much more insight into the operations. We were more agile in resolving issues and even preventing them a bit before they happened. And for that reason, the Ganglia still brings warmth to my heart.
Branko Toi?: 00:19:05.102 2014 was, again, a year of big change. Grafana was released that year. But unfortunately, for us, Ganglia or RRD, where the data was stored, was not one of the data sources that was supported by Grafana. Luckily, there were some workarounds, like using Graphite web components just to read the metrics from RRD files and then serve as a bridge. And up until now, we have invested a lot of time and effort into organizing our monitoring system. So we were very, very dependent on it. And switching databases at this point in time was almost mission impossible. So naturally, at first, we invested some time to make that bridge happen between Ganglia and Grafana since this was the fastest and easiest way for us to get metric dashboards. However, Grafana introduced us to new types of storage engines, mainly time series databases, and naturally, this led us to explore the InfluxDB back in 2015.
Branko Toi?: 00:20:14.496 So as you can imagine, that enormous monitoring stack that I was describing before, required a lot of moving parts to be configured. First of all, there was Ganglia that needed to be installed and configured on each host. Then, there was Nagios that was monitoring those hosts. We had a mindset of not collecting the same metrics twice, so we developed one middleware called gnag, and this middleware served as a bridge between metrics that were already collected by Ganglia and Nagios service checks. So by using the gnag, we could alert the Nagios alert manager on already collected metrics within Ganglia. And we can also use this for meta-monitoring of the collection system. So in case we hit some of those plugin bugs, we would know early on.
Branko Toi?: 00:21:10.148 To manage this part of the configuration, we actually developed a few custom puppet modules. You have to realize that, at the time, we had a vastly diverse infrastructure, where each client had its own set of requirements, different software stacks, and different kinds of configuration. So naturally, we had to adopt puppets rather slowly and carefully. To better automate the configuration of our monitoring system, we created the specialized puppet inventory module that could detect installed services on each host. It would then produce some custom facts, based on the running services, which would then be used to configure monitoring of those services. So having puppet configured in such a fashion meant that we can even manually install software on servers and that software would be detected and monitored the next time the puppet applied a new configuration set. In the end, we were left with the manual configuration of the Graphite as a middleware between Ganglia and Grafana.
Branko Toi?: 00:22:27.182 All the hard work of configuring every piece of the monitoring stack and the automation stack has paid off. So we had everything packaged up, easy to use and maintain, and we had a clear view on how to expand our metrics collection and alerting configuration. So during this period, we also had a big boom of sales in our VPS and dedicated servers, with a lot of custom solutions and clients requiring better uptimes and support. I must say that our entire team had peace of mind knowing that the software they deployed for the customer, even on an ad hoc basis, would be automatically monitored. So it’s improved the agility in deploying new servers and acquiring new clients, while still providing invaluable insights into system operations for our end customer.
Branko Toi?: 00:23:28.581 So as I mentioned earlier, Influx came to our radar with the release of the Grafana. And there was this cool new data source available, Influx, yeah, and this tickled my curiosity. So our first InfluxDB installation was 0.9, back in the day when Influx stored all the data into a LevelDB Storage Engine, and I’ll get back to that later. So what was very nice at the time of InfluxDB is that it supported out of the box Graphite write protocol, and this meant that our Ganglia installation could mirror all the data that we were already collecting to the remote Graphite cluster. In this case, Influx. So it was time to put Influx through its paces. To be perfectly honest, I was expecting disaster, because we were pushing well over 400,000 metrics every 15 seconds, and we already did this test with native Graphite on a similar hardware that we used for Ganglia at the time, and it failed this test miserably.
Branko Toi?: 00:24:40.395 To my surprise, Influx handled this exceptionally well, both on the disk I/O and CPU front. I would dare to say it was even better than the Ganglia with the rrdcached. We left this set up to collect metrics for a couple of days, but it would be soon evident that we will need much more storage if we wanted to store those metrics for the same period of time like we did in Ganglia. LevelDB, the storage engine that was used at the time, wasn’t really up to the task of compressing the time series data efficiently as RRD was. So it was a win some, lose some. You would get less disk I/O but you would have to play with more storage. This fact, alongside some other projects taking most of our focus, left us with little to no time to invest in a full monitoring system reconfiguration. And I must say that, in retrospect, this was a good thing. Because Influx involved, if memory serves me well, two or three storage engines before finally settling on the current storage engine, which would eliminate that huge consumption that we observed in the start with LevelDB.
Branko Toi?: 00:26:07.215 During this testing period, I would also like to do a special shutout to Telegraf and its ease of set up and configuration. It was an absolute breeze to set it up and configure it compared to any previously used software. So even though at the time we didn’t commit fully for our monitoring stack reconfiguration, we had a test InfluxDB running and accepting some metrics and we were upgrading it overtime to follow up on the development process and our minds were already set on changing the existing monitoring setup, but we were just waiting for the right time to do it and overhaul everything on a larger scale. We started using InfluxDB in production as our primary driver somewhere around early 2017. We did also consider some other time series databases. However, InfluxDB provided some key values that were a better fit for our use case. It’s an open core, meaning it’s basically open-source, with some paid features and support. So this gave us a sense of security that there is paid professional support if we would ever need one. It also provides high availability support in the paid version. So if we would ever require something like that, we can use it. For our business use case, what was really, really nice was the support of multiple databases per server, and this was a great benefit in building a multi-tenant system. So we have a quite few different clients with different use cases and isolating them into separate databases helped us keeping that on the overall database usage per client.
Branko Toi?: 00:28:10.616 From a technical standpoint, we really clicked with the push model of gathering metrics. And also, as mentioned before, Telegraf was very easy to deploy. It was just a single binary collector, as opposed to many other solutions out there requiring you to configure and monitor multiple processes that will collect metrics. Influx also plays very nice with others and it tries to be compatible with other monitoring tools. This opened up this opportunity of fast testing, the InfluxDB, alongside our previous monitoring setup, and I appreciate this very much. But it also opens up possibilities, for example, to take some data from Prometheus exporters and fed it to InfluxDB or vice versa. Out of all the other time series databases we evaluated at the time, Influx was the only one that offered numeric and string data values, as well as data rollups. That’s something that we weren’t used to by using the RRD files in previous monitoring configurations. Last but not least, Kapacitor is also a very powerful tool in helping stack. I only regret that we don’t use this much and hopefully, this will change soon.
Branko Toi?: 00:29:44.986 So in our environment, Telegraf and InfluxDB are the powerhouses of our new monitoring system. Just as a quick overview for those that aren’t familiar with the full TICK Stack, here is a quick diagram that I borrowed from the main Influx site. On the left, we have this collector, or Telegraf, that collects the majority of metrics and pushes them to InfluxDB for long-term storage. On top of that is a Chronograf for data visualization and exploration, as well as interaction with Kapacitor. Kapacitor, on the other end, is, as I like to call it, a Swiss army knife. It can be configured for various tasks, either for data downsampling or transforming the Influx push model to a pull model so you could pull metrics from other systems. It can also do anomaly detection and alerting. For us, the visualization platform of choice was a Grafana. And for alerting, we are still actually using one of the Nagios forks with some similar bridging middleware that we used with the Ganglia. But we are considering to replace this with a Kapacitor in the future. Just haven’t still managed to configure it in a way that we have our current alerting system configured so that we can retain our current alerting rules and flexibility.
Branko Toi?: 00:31:26.930 So this leads me to Telegraf. And what’s very nice with Telegraf is that it has its large number of built-in plugins for collecting metrics. But what I love about Telegraf is the ease of extending metrics collection. So, for example, on some specific projects, we were tasked to configure and monitor a PowerDNS service. So at the time when we were doing this, PowerDNS was not one of the inbuilt plugins that were available within Telegraf’s inbuilt plugins list. Today, you have a native Telegraf plugin for collecting metrics for PowerDNS, so I really wouldn’t advise you to use anything that I’m about to show you. However, for this demonstration of extending metrics, it will serve its purpose. So if you do happen to find yourself in a situation where you have to collect something but there is no native implementation you can use this exec plugin. So as the name suggests, this plugin will execute your defined command and it will accept data specified as input and insert it in InfluxDB. Now, all you have to do is create this executable. So in our example, this was a simple case of [inaudible] page from PowerDNS and transforming that data into a valid json that Influx would understand. So by using just those few lines of code and configuration, you end up with a bunch of metrics that you can organize in dashboards like this. And dashboards like this give you insights into internal operations of your service, in this case, PowerDNS, which could one day be crucial for debugging any potential issues in production. So it’s very important to set up your monitoring early on just to get the baseline performance data so that later on when you’re hit with a problem you can go back and compare the data.
Branko Toi?: 00:33:47.445 Just as a quick showcase of some standard plugins and the data that you can gather with them, we have just a really simple plugin configuration, which will collect data from this local Redis service. And it’s basically two lines of code if we ignore the comments. And when you collect all the data, we can then organize all that data into dashboards like this. So bear in mind that there are many other metrics available with this plugin, so the final design and how you wish to see the data is totally up to you. A similar configuration just for network-related metrics on standard Linux hosts and the graphs representing that collected data.
Branko Toi?: 00:34:35.800 I was mentioning before, one of the strengths of Influx is interoperability with other monitoring systems and their protocols, and I was also mentioning that we do a passive web application health monitoring, so let me just quickly fuse those two examples. Telegraf features a StatsD protocol and StatsD is widely used in many frameworks and programming languages for collecting in-app performance metrics. So what you would usually do is you would instrument your code functions with StatsD. So, for example, when you’re entering the function, you would do a timer, and then when testing a function you would do another timer, and you would push the total time spent within the function as a timer to StatsD. Or, for example, you could do a counter and increment it just to see how many times the function or part of the codebase was involved. It’s totally up to you as a developer. And we, on the other hand, in the ops use it a little bit differently. So in the case where we have PHP applications hosted, we can easily install the APM PHP extension and then configure it to shift metrics to the local StatsD interface, or in this case, Telegraf. A quick word on the APM extension. This will gather various PHP performance data about running PHP processes. And let me quickly show you what the end result is.
Branko Toi?: 00:36:16.401 So with previous configuration sets and [inaudible] time, we are now presented with the indirect PHP health information on how the application is performing. We can see what are the execution times or how many CPU time is used by the PHP, what’s the distribution of response codes or what’s the distribution of errors. So in some cases, this can prove to be very valuable information. For example, we can start tracking return status codes and timings. Then, you can create alarms to warn you when those metrics reach certain thresholds. Also, for us in operations, having this kind of information early on, alongside with its history, when we are doing any kind of system modification or system upgrades, we can quickly turn to those graphs to see if there were any kind of bad performance implications or if the application started to return any kind of errors without even touching the application. On the other hand, if there are no system causes for those alerts or misbehaving metrics here, we can then forward this information to our end clients, or the developers can investigate this further, be it may be airing some part of the code or deploy or anything like that.
Branko Toi?: 00:37:54.741 Similar to PHP, we also gather HAProxy metrics. We’ll use HAProxy a lot, even on single servers. And here we can see and correlate similar metrics but in a different part of the stack. So we can see how the requests are propagating through the system and that correlate the data or some errors or some slow timings. Also, in this example slide, we can also see this vertical annotation point that developers or system administrators can use to set markers for any major changes, be it codebase or infrastructure change. So this way, we have a clear visual clue on the exact times when something changed and what effect did it have on infrastructure or the application itself.
Branko Toi?: 00:39:00.099 So let’s go a little bit modern. A quick disclaimer. Even though we don’t really do Kubernetes deployments in the form of managed Kubernetes for clients, we do use it in some capacity for our own needs. And when we are talking Kubernetes, people usually associate monitoring of those clusters with some alternate time series databases. However, it is not that hard to set up cluster monitoring with Influx and Telegraf as well. So here you can see are snippets of general configuration for Telegraf DaemonSet and coding map for its deployment. More extended examples can be found in the link in the slide. I was told that these slides and the presentation will be shared later on, so you can refer to it later. For our own environment, we did some modifications to those deployments, but the results and the premise are the same. You’re basically deploying the DaemonSet into the Kubernetes cluster, meaning that Telegraf will be running on each host in the cluster, collecting localhost node metrics, just as with any other Linux hosts. And on top of that, and you can also collect Kubernetes-specific metrics via its own API or Docker-based metrics. As a result, you will end up with an aggregated metrics about the containers and namespaces, as well as per node-specific metrics where you can then filter utilizations per namespace and/or [inaudible].
Branko Toi?: 00:40:49.371 So how do we organize our data? In total, our Influx servers serve over 8K writes per second, collecting well over 130,000 series. Each client has his own Influx database, and some of them are running their own dedicated InfluxDB servers and some are sharing one or more servers. It all depends on the resource usage and the level of isolation that the client will require. By organizing clients into these separate databases, we can then expose the databases containing only its data to the end client. So this is basically doable either by configuring different Grafana organization or Grafana installations altogether. For our own needs, it’s very hard to traverse many different Grafana installations. So both of our Grafana dashboards in our centralized place have a variable structure that is shown as in the picture. So we can then quickly switch between the data sources, or in this case, client databases, their retention policies, and then filter out by hosts or on their host properties.
Branko Toi?: 00:42:14.222 So this would be on the resulting landing dashboard, that contains some general host information. And on the right, there are drill-down links to dashboards displaying more specific data, be it system-wise or services-wise. So by selecting the desired values in the variables above, we can traverse all links to dashboards, preserving them, and then quickly moving between different points of interests. For example, Redis or the network dashboard [inaudible], as I was explaining previously.
Branko Toi?: 00:42:51.052 To keep the disk use at bay we are using several retention policies with the input data. For example, all the autoscaling host data is kept for seven days as the scaling events will create and destroy hosts with unique host dates, which will then easily clutter the dashboards and total series counts. Also, some less used metrics with high cardinality that don’t have any long-term value, at least for us. For example, system interrupts are also split into shorter retention policies. At the moment, we don’t do data downsampling, as we can accommodate for a year worth of data on disk without modifying its precision. But it would be great if you could lower the disk usage by downsampling data for long-term storage. We are waiting for the intelligent roll-ups feature to be released, as this would ease up the downsampling process and create the data, but this is not the only blocker for us. There are also some other Grafana-related feature requests for InfluxDB source that will help with this as well.
Branko Toi?: 00:44:09.143 So the last question is, what the future brings? Well, we are quite excited about the Influx 2.0 and some new features that it brings. We still are not using it. Sure, this will mean evaluating our current setup and redesigning it a bit, but at this point, we are used to that, and we have the infrastructure in place that they can easily switch databases. What we are also very excited about is the new Flux query language, which will hopefully bring us easier to configure Kapacitor scripts and evaluations for alerting. One other thing that we would like to explore in the future is better anomaly detection and prediction, be it natively via Kapacitor or some kind of AI, like Loud ML. And as always, we will be expanding our collected metrics set even more. We are true believers that there is no such thing as too many metrics. And thankfully, we have Influx here to store all those metrics with the peace of mind of how and where our data is stored. So thank you very much for your attention, and this is the end, so we are open for questions.
Caitlin Croft: 00:45:41.813 Thank you so much. Loved your presentation. I particularly really liked the timeline and showing all the different monitoring tools that you have used over the years. And I love all bees that you guys have in your presentation. I think that’s very fun. So while people think of their questions, just want to give everyone another friendly reminder. Last week, we had the European edition of InfluxDays for 2020, and we’re super excited to be offering the North America edition of InfluxDays in November. So there will be Flux training prior to the conference and then the conference itself. And of course, everything will be held virtually again. So we’re super excited. Call for Papers is open. And since this is a virtual event, it’s really fun to see people from around the world coming. So if you’re concerned about the timing, we can definitely figure that out. So please feel free to submit your Call for Papers. It’s open. We’re super excited to get started to reviewing them. So it’s a really great opportunity to get to meet different InfluxDB users from around the world. So we have a question here. Is there any way, hard or easy, to migrate data from Nagios RRD into InfluxDB, even including some intermediary steps?
Branko Toi?: 00:47:29.092 So if I understand this question correctly, the issue is about moving existing data from RRD to InfluxDB or even from Nagios performance collected data?
Caitlin Croft: 00:47:41.690 Yes. Yep.
Branko Toi?: 00:47:42.457 We did explore this at the time when we were evaluating InfluxDB with our current monitoring setup. Well, basically, the RRD has its own API with different kinds of program languages. You can then extract the data from RRD and feed it to InfluxDB. But you will have to do some kind of a middleware, where you will organize those data from RRD to the InfluxDB format. Because obviously, they are not the same. We did explore this possibility for a short period of time because we were not very keen on losing all the data that we already had. However, luckily for us, we had the Ganglia in place at the time, so we could quickly abandon that thought because it seemed too complex for us. So we were just mirroring metrics from current monitoring configuration to InfluxDB using Ganglia. So we had some kind of a big [inaudible] also in InfluxDB. [crosstalk].
Caitlin Croft: 00:49:08.281 George, if -
Branko Toi?: 00:49:09.361 It is possible, but it’s not that easy.
George [attendee]: 00:49:14.813 Thank you very much.
Caitlin Croft: 00:49:15.546 Okay. George, yeah, I unmuted you. So if you have any further questions for the guys feel free to expand upon your question.
George [attendee]: 00:49:24.384 No, no, that’s all. Thank you.
Caitlin Croft: 00:49:28.246 Okay. Great. Yeah. I really loved your presentation. I thought you guys did such a phenomenal job going through all the different monitoring tools that you guys have used over the years and figuring out what was the best solution for you guys, given all the data that you guys were collecting.
Saa Tekovi?: 00:49:47.818 Thank you.
Caitlin Croft: 00:49:50.321 Let’s see if there’s any other questions. So have you guys looked more closely at Flux? When do you think you guys might consider looking at 2.0 and looking more at Flux and Kapacitor?
Branko Toi?: 00:50:14.617 So what we would like to achieve is actually to try and [inaudible] Nagios or at least it’s [inaudible] right now for alerting because we would like to lower our footprint of our monitoring stack. We did a large reduction of moving parts in our monitoring stack just by replacing Ganglia and all other bridges and intermediates and everything like that just by using InfluxDB. We would also like to use this Kapacitor for alerting. We do have some very specific alerting rules. For example, with our clients, there are different kinds of requirements, and there are a very large number of hosts that we are monitoring. So, for example, on one host the other will go on certain thresholds, while the other host will go on the other threshold and this is very hard to do with generalizing things. So we are just maintaining one single cluster. We are maintaining multiple different kinds of clusters.
Saa Tekovi?: 00:51:35.418 Yeah. I will just add that, with the current stable version of the Kapacitor we found it very hard to templatize its configuration and distribute it with the puppet configuration management system. So we actually, a while back, started to look into Flux query language because we found that it might be much easier. You could wait for a stable release of Flux query language and a new version of InfluxDB. So then we will, again, revisit this idea of fully utilizing Kapacitor in our production environment.
Caitlin Croft: 00:52:25.063 So do you think you will - oh, okay. Never mind. It looks like you answered the person’s question. Well, this was a great presentation. Oh, it looks like someone’s raising their hand. Let me see. Oh, looks like we’re all set. So this session has been recorded. So for any of you who would like to re-watch it, it will be available for replay by tonight. And if any of you have any questions that you think of after the fact - I know that happens to me where I join a webinar, and then right afterwards, I think of a question that I wanted to ask the speakers - all of you should have my email address, so if you want to want to email me with any further questions, I’m happy to connect you with Saa and Branko. Thank you very much, everyone, for joining, and thank you, Saa and Branko for presenting on how Sysbee is using InfluxDB. It was a great presentation.
Branko Toi?: 00:53:31.413 Thank you for having us.
Saa Tekovi?: 00:53:32.569 Yeah, it was a joy.
Caitlin Croft: 00:53:35.223 Glad to hear it. Well, thank you very much, and we’ll talk to you all soon. Bye.
Branko Toi?: 00:53:42.103 Bye-bye.
Linux System Engineer, Sysbee
Saa has been part of the Sysbee (and previously Plus Hosting's) core team for over 12 years now. Saa is well versed in planning, implementation and maintenance of private clouds, VPS and dedicated server hosting as well as shared hosting infrastructures.
Linux System Engineer, Sysbee
Branko is well versed in monitoring systems, as he's been a part of the Sysbee's (and Plus Hosting's) core team for over 15 years now. As our CTO, Branko helped build the foundation for our excellent infrastructure, methods and IT-related processes which we're using in our day-to-day work to ensure a fantastic client experience for our customers.