Coming soon! Our webinar just ended. Check back soon to watch the video.
Webinar Date: 2018-08-01 08:00:00 (Pacific Time)
Monitoring distributed systems is not a trivial task. There are many non-obvious obstacles in your way and there are many solutions for performing various different monitoring tasks. In this webinar, Alex Tavgen, Technical Architect at Playtech, will share what their analysis of successful and not-so-successful monitoring projects showed them, and last but not least—why they decided to build Yet Another Alert System using InfluxDB, Alerta, and Grafana.
Watch the webinar “How Playtech is Using InfluxDB to Monitor Their Distributed Systems” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How Playtech is Using InfluxDB to Monitor Their Distributed Systems”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Alex Tavgen: Technical Architect, Playtech
Chris Churilo 00:00:00.989 All right. Three minutes after the hour. If you guys have ever been on any of my webinars, you know that’s my starting point. So good morning, good afternoon, good evening. My name is Chris Churilo, and I work at InfluxData. And today, we have a really wonderful use case with one of our fabulous users, Alex from Playtech, and he will be going over how they use InfluxData. And if you do have any questions at any point, please put your questions in the Q&A or the chat handle of the Zoom application. And if you want to speak your questions aloud, just raise your hand and at the end of the webinar, I’ll be able to open the lines. This session is being recorded. After a quick edit, I will post it to the website. And the URL will be the same URL that I use for the registration of the webinar today. But an automatic email will be sent tomorrow morning, so you’ll also have that in your email box in the morning. And also the slides, I typically will post those slides into SlideShare as well, so you will be able to have both of those things at your fingertips and take another listen to them. So without any further ado, I’m going to hand it over to Aleks and let him share the screen and get started.
Aleksandr Tavgen 00:01:15.995 Hi. Hello?
Chris Churilo 00:01:19.677 Hello.
Aleksandr Tavgen 00:01:20.898 Hello. My name is Aleksandr Tavgen. I’m from Estonia, and I work as a software architect in Playtech. So actually, I started to work in IT sector more than 20 years ago. First touch with data science was in 2005 in Tele2 Sweden. It was not called this way. And so last year, we worked on some very interesting use case. So let’s start from the beginning.
Aleksandr Tavgen 00:02:06.299 In Playtech, we have been facing a lot of issues with the early detection of outages and other problems. So actually, when we talk about understanding system behavior, we should have observability. So if we take from the basics, then login is first line of defense to understand what’s going on with your system, with your program, whatever, in real time in production. But there are a lot of solutions which build on log analysis such as Splunk or a famous open-source solution like Elastic tech.
Aleksandr Tavgen 00:02:58.527 But there are two quite different approaches to monitoring. If you would take a low-level monitoring, which is monitoring of infrastructure, CPU, disk, memory usage, networking, in most cases, applying some simple thresholds is enough. For example, in case of Java Virtual Machines, we can count garbage collection runs, allocation rates, etc., etc. However, low-level monitoring cannot reflect business logic processes in an adequate way because it you have a lot of interdependencies between various component services, if you have some dependencies on third-party services, then low-level monitoring will not give you information about business processes.
Aleksandr Tavgen 00:03:55.391 So if we talk about high-level monitoring, this means that we should monitor business indicators or key performance indicators. For example, user sessions, payments, transactions. You name it. And this makes it possible to indirectly monitor complex system behavior, especially in the case of distributed systems when you have different sites, different data centers. And the main thing why it is difficult—is various problems may lead to non-obvious system behavior. Various metrics may have different correlations in time and space. Even monitoring the complex application is a significant engineering endeavor in and of itself.
Aleksandr Tavgen 00:04:53.045 So if we talk about measurements, then even on this stage you can face a lot of problems. For example, you can collect your business metrics from database and make analytics on that, but it means that your production database needs more and more resources because it’s not optimized for working with times-series data. So there is another possibilities to build that process, to build that systems. Because if we take some earlier examples such as Zabbix, where everything was put in one place—then modern systems are moving towards distributed solutions, where you have a time-series database, where you have specific monitoring components, where you have some visualization like Grafana and so on. So actually, which customer sees, it’s not what is going on in the backend.
Aleksandr Tavgen 00:06:11.529 So we had, in Playtech, quite a bad solution which were originally from Hewlett-Packard. When you’re on the marketing prospectus, it should solve almost everything. Its solution, which is based on machine learning, can predict your outages. But actually, it won’t work. It had a huge amount of false positives. It was somehow black box, where you have no possibility to tune, to improve settings, and so on. And the most important, we have information about outages. After that, when we observe it in real time by our customers and so and so, it was so late that it won’t fit for us at all.
Aleksandr Tavgen 00:07:16.133 So if we talk about analyzing successful and non-successful monitoring systems, then I want to present two cases. One of them is Etsy. Etsy is an online marketplace of handmade goods with headquarters in New York. Their engineering team collected more than 250 different metrics from their servers and tried to find anomalies using complex math approaches. So meet Kale. One of the problem was that the system was built using different stacks and frameworks. In the picture, you can see four different stacks plus two frameworks. As a result, highly qualified engineers with experience in working with different languages and frameworks were required for maintaining and developing such system as well as for fixing any bugs that were found.
Aleksandr Tavgen 00:08:26.785 Their approach to monitoring actually was also problematic. Kale searches for anomalies or outliers. However, if you have thousands and hundreds of thousands metrics, every heartbeat of the system will inevitably result in a lot of outliers just because of statistical deviations. Etsy engineers actually made a futile attempt to combat the false positive, and finally project was closed. Andrew Clark, a data scientist from Etsy, discussed about that in a very good video, and here I want to present and emphasize one slide from his videos. So here it is.
Aleksandr Tavgen 00:09:14.852 Firstly, anomaly detection is more just outlier detection. There will always be outliers in any real production data, in the case even of any normal system behavior. Therefore, not all outliers should be flagged as anomalous incidence at any level. So second, one size-fits-all type of approach will probably not fit anything at all. There are no free lunches. And the reason why Hewlett-Packard solution does not work as expected is that it is trying to be universal instrument. And finally, most interesting part of the slide addresses the possible approaches to solving these problems. Alert should only be sent out when anomalies are detected in business and user metrics, and anomalies in other metrics should be used for root cause analysis.
Aleksandr Tavgen 00:10:18.569 So next case, which is more successful than Etsy Kale is the experience from Google SRE teams, BorgMon. According to the authors of the system, Google has trended towards a simpler and faster monitoring system with better tools for post hoc analysis. Based on the principles, Google’s engineer built system where rules are kept as simple as possible, which makes it possible to detect a very simple, specific, and severe anomaly very quickly. So they try to use some sophisticated machine learning algorithms, but they found that it couldn’t work on some critical part of the system. You can use some fancy machine learning algorithms for predicting marketing, campaigns, and so on, but it won’t work for KPI and business metrics.
Aleksandr Tavgen 00:11:32.596 So let’s discuss some short case with the Playtech. We had more than 50 sites with a lot of brands located in different part of the world. We had a lot of products. We had a lot of services, a lot of configuration. And sometimes even some stupid error or mistake in the configuration file, or some certificates problem, could ruin a lot of services. So the tool which we had from Hewlett-Packard had a very low efficiency, and a lot of false positives, and horrible operability. And that was the motivation for building predictive alert system, which should be able to detect degradation before end-users do.
Aleksandr Tavgen 00:12:38.139 So let’s start from the beginning. What is time-series data? Time series is a series of data points that are indexed, or listed, or graphed, whatever, chronologically. Economical processes have a regular structure. For example, number of sales in a store, the production of campaign, online transactions. Usually, they exhibit seasonal dynamics and trend line. And using this information simplifies analysis.
Aleksandr Tavgen 00:13:13.099 Let’s take the very simple example of time-series data. It’s a stationary time series, which means whatever you transform this data, move, this sole characteristic does not change. And you can describe this type of data with the two parameters: mean and standard deviation. So all other time series is non-stationary. So they have a trend line, as you can see, or they can have a dispersion change, or there is some covariance changing present. So if we have some economical processes, then we can see that a lot of them has trend lines and some irregularity.
Aleksandr Tavgen 00:14:23.596 So let’s take our sample data. Let’s say it could be whatever. It’s the number of transaction in online store. You can see in the left bar an increase in activity, which means that this is Friday. Then in Saturday, we have more and more activity; and Sunday, less. And then we have our working days, where we have our normal, regular values. We can see here strong data regularity and trend lines. So let’s make a simple analysis. If we cut half our data, then we can see here, clearly, trend line. It means that this data is non-stationary. So we can use a very simple approach, linear regression, to interpolate this data by line. And we can see that if we subtract this trend line from our initial data, then it looks like a stationary time series. And it means that we capture the signal. We can prove that this time series is stationary.
Aleksandr Tavgen 00:15:47.729 There is a method called Dickey-Fuller test which tests time-series data stationarity. If we take a test on our initial data, we can see that autocorrelation is very high and test results is near one, which means that this test rejects our stationary hypothesis. If we run this test against our results, we can see that result is near zero, and it means that hypothesis about stationarity is acceptable. Why do we use that? Assume that we have a model that describes our signal with some precision. We can subtract the model values from our measurements. And the more our model resembles a real signal, the more our residue will approximate the error component, or stationarity, or white noise. And we can easily check whether our time series is stationary or not.
Aleksandr Tavgen 00:16:53.489 So if we take a little more examples, so it’s chunked by a half an hour, which are approximated by regression line, and put it together, we can see that we approximate our noisy data with these lines. If we test it after subtraction, we can see that this is stationary time series. So it’s one of the clearer approaches how we can model data and how we prove that our model capture the signal. So it looks okay on our test. So actually, I’m using moving statistics. It means moving average is essentially a low-pass filter that passes signals with a frequency lower than a certain cutoff, when used to make the results in time-series data smoother, removes noise, and leaves the main trend lines intact. The same way, we can collect the moving variance statistics, and we can build from that our model. If we fit here data from the next week, we can very clearly find outliers here.
Aleksandr Tavgen 00:18:24.348 So I tried to test some other algorithms, which one of Twitter S-ESD. New, fancy, but no, it didn’t fit for us because it was more oriented on low-level monitoring. So it works like this. Tries to detect peaks and anomalies on almost any of the time series.
Aleksandr Tavgen 00:18:55.403 So simple is better than complex. Drop everything you don’t need. And for us, when we made the initial data analysis on our actual metrics, we started with some simple architecture, which based on Influx tech. And why we have chosen our Influx? So what started as this small piece finished right now in things like that. So for Playtech, for our company, it was very important to have observability, to understand system behavior to predict possible outages and problems in the very early stages.
Aleksandr Tavgen 00:19:50.827 So if we talk a bit about architecture, then I’ve chosen Python because it is a natural ecosystem for building a project for data analysis. We have, in the Python ecosystem, billions of libraries which are intense on statistical analysis, on machine learning. And we’ve chosen InfluxDB. Why? Because InfluxDB is widely adopted by the community. It is fast, it is reliable, and comes with a great support. We have no, almost, problems with InfluxDB even if we had large, parallel tasks running and fetching huge chunks of data. We have almost no problems with that.
Aleksandr Tavgen 00:21:02.752 So our system was built as a set of closely coupled component or microservices that are executed on their own Python virtual machines. It’s a natural way of defining the boundaries and the responsibilities of each component, which makes it easier to extend the system and add features independently and on the fly without fear of breaking something. And it also makes it possible to perform distributed deployment or implement a scalable solution just with some configuration changes. This system also has an event-driven design. All communication goes through the message queue. And the system works in asynchronous way. So I choose ActiveMQ for implementing the message queue because it’s also quite stable and well-known solution. But it can easily be replaced with a RabbitMQ because, again, we use some standard protocols and agnostic to message queues.
Aleksandr Tavgen 00:22:11.574 So if we talk about monitoring, then the main stage is collect the right data. Because there is a famous phrase, “Garbage in, garbage out.” If you have some missing data or some holes, it’s very hard to guess on an analysis part what is going on in the systems. So the first part which holds the statistics is called Event Streamer. So we collect for every metric ensemble of models. And every heartbeat of the system, this component fetch data from Influx, tests it against statistical models. And if violation is found, then it sends message with all information about this violation. And this is somehow a first information emission layer. We as human beings actually solve problem in the same way. If we notice that something strange is happening in one part of the system, we check the other parts, propose some hypotheses, and try to disprove it. So second layer is information consumption layer, which is called Rule Engine. So every message with violation information is collected and matched and analyzed against different rules. So we can set up some correlations between different metrics. We can set up different thresholds, different speeds of degrading.
Aleksandr Tavgen 00:24:13.989 So if we go further, then as I said, system has a lot of vanilla Python distributed, and benefits is very low footprints. Everything is running right now for us in one virtual machine, which has eight gigabytes of RAM and two virtual cores. It has a very low footprint on the system, on the networking. We can replay historical data and tune system whatever we want, and that takes less than nine months of research and development. So we can see some user interfaces which we have. Here is the part where we can tune and adjust our models. So we started with just one model. But if you have some outage last week, then you will have a hole in your data. Sometimes some systems have a higher activity in the beginning of the month and so on, so it was necessary to have an ensemble of models. We collect four different models, and finally we make [inaudible]. So we can easily turn on, off, or increase weight of some models for our system.
Aleksandr Tavgen 00:26:00.107 So we have a reporting engine, which means that after alert is emitted, after rule were matched, a URL with the alert description is made, and report engine on demand generates a report, and you can see which metrics came down and how it looks like. One thing about Rule Engine. In the beginning—the proof-of-concept—rules were hardcoded. But again, it’s not a feasible long-term solution when you have to change, to experiment with your rules. So we create a language which is based on YAMU, and we can define the rules. We can define conditions with the metric, different parameters. We have logical operators, and it means that we can very—we can have a very precision on every type of our rules for alert. We have a parameter of the speed. Speed means how fast your metric’s degrading. Though for serious outages, when you have almost immediate drop in some critical metrics, the speed will be higher than in some slowly degrading metrics. So actually, in the beginning, because we want to be agnostic from data storage, we built a company which calculates the derivatives, which means speed. But Influx has a very good engine for making all the separation of mass functions, taking derivatives, taking means and so on, on a site of database. So it’s easier to make some shorter requests and create some application-based engine.
Aleksandr Tavgen 00:28:23.720 We have right now, actually, a quite interesting research in this field, which is made by intern. She’s preparing in her master thesis—and we decided to test deeper learning approach. So actually, yeah, it is quite easy to understand if neural networks can detect some texts, pictures so you can use the same approach to detect outages on your metric sides. Last summer, one interesting academic publication were made in this field, but it was tested against financial data. We made some initial measurements, and you can imagine time-series data from your metrics, which is, for example, stacked up to another from one side, and see it as almost huge, huge picture. And when you have a window that goes through this stacked metrics, you can train you neural network to detect all digits if you have data, when all these outages were, and you can test it. And we had, as I said, quite good results. But again, we need to have some months of measurements and analyzing results.
Aleksandr Tavgen 00:30:16.858 So this is last slide, but I want to talk more about approaches for the future of time series analysis because, as you know, software is “eating” the world. And we have a huge progress in creating new technologies. We have self-driven cars almost coming. We had Internet of things which is growing. We have our software architecture which is moving more and more on public clouds. And all this means that you and we as mankind will have a huge stream of data, huge stream of data every minute, every hour, and a new type of approach is needed for that. So world and software development, manufacturing, agriculture, should have more observability on their business, on their activities, because even some small improvements in energetic sector, in agriculture, can be beneficial for everybody. And we can see that nowadays, more and more companies and products are moving towards autonomous based on some sort of machine learning solutions which can exclude human labor or human routines from the monitoring, from the processing of assurance that system has worked. And the most important, if executives or people who make decisions have better information, have a better observability on every field, then it means that effectiveness of all processes could be increased and increased. And this is good for everybody because we can make better with less resources.
Aleksandr Tavgen 00:33:11.661 So it was quite interesting adventure in developing such kind of system because you see a lot of mistakes which were made before you and some mistakes that you have done. But we right now are constantly expanding the system and adding new functionality. So we want to build business intelligence in near real time. Again, if you have some metrics, some key performance indicators in time-series database, then you can visualize it, you can observe it in an automated way, and you can bring more meaningful information for all the parties who needed that.
Aleksandr Tavgen 00:34:13.640 So this was some kind of short story about this system. And if you have some questions, so I’m ready to answer on that [crosstalk]. Yeah.
Chris Churilo 00:34:34.888 Cool. Well, if anybody does have any questions, please look into your Zoom application. You should see a chat panel, or a Q&A panel button, or chat panel. Just click on that and you can type in your questions and ask Alex any of the really tough questions. And as we’re waiting for questions to come through, I just wanted to commend you on what you were saying about observability because that’s something that we’ve been discussing quite a bit internally. It’s no longer about monitoring, right? It’s really about as you stated, right? It’s about really understanding—because you’re right. Everything is software, right? Containers are basically software. And you’re right, I think the only way that we’re going to survive as businesses is collecting those small improvements that you can identify within your systems by observing, as you mentioned, the business activities as well as the system activities. Those things can really add up and have a huge impact.
Aleksandr Tavgen 00:35:39.776 And again, so-called classical companies in energetic sector, in agricultural, they have a huge inertia. For them it’s very hard—sometimes it’s not so easy to adopt those kinds of technologies. But again, I also read an article from your CTO when it was presented like a use case, where wind turbine turned up the wind in a 5% more optimal way. But it means a very huge efficiency improvement for, again, energetic sector. And I think that we have a huge amount of data. We can use some health services data to understand, for example, whether some critical problems could arise. I mean that even in adopting such kind of system in health insurance could be effective as well. So it’s not only about software as well.
Chris Churilo 00:37:10.631 Yeah. No, you’re absolutely right. And yeah, in that industrial IoT sector, I think what we’re seeing is that there are certain sectors in there that have been moving a lot faster. So we’ve been seeing a lot of movement in the renewable energy sectors adopting the same kind of approach that you guys have taken at Playtech. And my hypothesis is that it’s probably because that is a newer sector to begin with. They’re probably not as heavy in—they don’t have as heavy ties with local government agencies, right? Some of the more traditional energy sectors have—
Aleksandr Tavgen 00:37:51.383 Yep.
Chris Churilo 00:37:51.894 It’s not easy for them to be quite as nimble. I also think that there—I mean, just renewable energy in itself is kind of a new and forward-thinking concept. So I think it’s just natural for people in that area to adopt different and more forward-thinking ways. But you’re right. There is a lot of data. We’re definitely not taking advantage of it. I feel like the other thing—and I want to talk a little more about this with you. I think there’s still a lot of people, even in software, where they’re still monitoring. And monitoring to me is just not enough, right? It’s not really getting to what’s actually happening with the business.
Aleksandr Tavgen 00:38:38.036 Right. Yeah.
Chris Churilo 00:38:40.726 So maybe you can talk a little bit about that. And what does monitoring mean to you in Playtech versus observability?
Aleksandr Tavgen 00:38:49.584 So I can say that observability is more wider than the monitoring. Monitoring is about getting assurance that system behavior, or your business, or whatever, is going in the right way. But observability means better understanding of how system behaves, and how a system behaves in a more deep way. I mean that after implementing this project and getting results, I found that domain experts in the company, who had some assumptions about some part of system behavior, were totally wrong. So we met a lot of things that everybody thought that it’s obvious, who understand this and this part. But it came out that no, it was not like that. That means that people actually have biases, psychological biases, especially when you work as a domain expert in some field for years and you have a lot of such kind of biases. And observability means that you can’t prove your hypothesis, but you can disprove it. You can get more information, which gives you more insights about what can be proved, what can be solved, and so on. So monitoring is just a piece of that.
Aleksandr Tavgen 00:40:47.342 And when we want to go further, then we need to build analytical system, which creates a correlation between different parts of different systems. For an example—very, very simple—yeah, if you have a web online shop and you have some transactional activity which you collect and you have some logins, sign-ups. And assume that you have some problem with your login service. So you have some users which make some shopping on your site, but you have no new customers and your activity begin to decrease. So it means that there is correlation which shifts in time between these two metrics. If we take more and more different indicators, metrics, and so on, then we can find those connections between them. And actually, when you understand those patterns, then you can create a system which predicts if, for example, logins begin to decrease. Then system can told you immediately, without waiting to drop in activity, that guys, you will face some problem after, I don’t know, 15 minutes or half an hour.
Aleksandr Tavgen 00:42:24.883 So this is very simple example. But actually, if we talk about some Internet of things where you can have some different sensor, weather sensors, some humidity—you have sensors on the fields – then it’s possible to, again, predict some outcomes based on collected data. So again, we as human beings are not very effective in analyzing huge chunks of data. And we see right now that we have a very large amount of system which try to solve this problem. I can see that Splunk are moving towards machine learning. Kibana and Elastic’s tech have this one. AppDynamics has some algorithms which analyze application behavior. But again, all those system are oriented on specific software problem. if we talk about a lot of things, about self-driven cars, which I hope is our future, then we need platforms that are going outside of the software development. And I think that this is a way which some new market will arise and will grow pretty fast soon.
Chris Churilo 00:44:21.314 Cool. So we actually have a question from Vladimir, and it’s actually a question I had in my mind too. So could you give an example of an incident or a malicious application behavior that was detected by the system that you built at Playtech?
Aleksandr Tavgen 00:44:36.130 So we had a—as a company which operates financial transactions, you will always have attempts to attack. For example, user enumerations when some scripts try to guess passwords or usernames. So all those kinds of things actually has a very clear pattern on data. For example, if somebody has an enumeration attack, you will have a huge amount of false login attempts. If in a normal way you have, let’s say, 1% failed logins, and if this amount is increased up to 5, 10 percent, it’s forced login. We had some problems with the configuration when you make a deployment and you have some mistakes in your configuration and some services didn’t start well. So we detect it earlier because we see some activities which is critical—and we have special rules for that – begin degrading. So we can pretty fast catch all those moments. Actually, we catch numbers of incidents within a week, and usually this gives for us very quick reaction. And if you have, for example, outages that last 15 minutes longer, it means that you have lost customers and some lost customer satisfaction as well.
Chris Churilo 00:46:48.985 Cool. Vladimir, just let us know if that answered your question. I mean, it certainly did for me. That makes total sense, yeah, because those things can definitely be either an indicator of, yeah, a malicious attack or users actually having problems getting in. And trying to pinpoint that as quick as possible is pretty critical to your business. So you mentioned some other tools that are out there, so AppDynamics, etc. So why did you guys choose to—why did you choose not to use those different solutions? Or do you use—?
Aleksandr Tavgen 00:47:26.515 Yeah. It’s a very good question, actually, because, as I said, you can analyze your system application error in different ways. Yeah? One of those you can—for example, if you store all your critical data in the database, yeah, you can query those data in a regular way and you don’t need, for example, time series database. You don’t need for high-level monitoring, some application-based instrumentation. But it means that you will have increased load on your database because your time series database is optimized for specific usage. So there is no updates. Data is just added. The latest data is queried more often than some later ones. If you want to have better observability on your application side, yeah, if you want to see why some function calls take a longer time, yeah, you probably need to have some instrumentation-based solution on the application side, which [inaudible], for example. But again, it means that you need to use some agents on your application side, and it means that those agents will consume some computational resource as well. And again, there is different levels of monitoring, one which is more low level. When you’re just controlling resources, your disk usage networking and DVM or application monitoring on the application side, somehow plays between high and low level, but on more low level side, especially if you have lot of containers and so on and so on. So we have a lot of applications which were in some nondeterministic way.
Aleksandr Tavgen 00:49:37.935 Another possibility to have understanding how your application work is collect logs. And Splunk or Elastic has solutions which also creates some metrics based on different type of logs and apply some machine learning. But here is again that—if you have a large system, you have a huge, huge amount of logs. 90% of those logs, not interested, and usually it’s very important to find just those 10 or 5 percent logs. So you store a huge amount of information, and it’s not to want to create good engine for analyzing those logs.
Aleksandr Tavgen 00:50:32.211 Another possibility is just to collect, in some regular way from your productional databases, little pieces of information. For example, transaction from last minutes or logins from last minutes, from hour, whatever, and storing in a solution which is specially designed for that. So if we’re talking about Influx, so we can store those type of events here, and it means that we can work with this data without increasing the load on our production database. We can replay our outages. And our system also has ability to replay all incident and tune models. You have a possibility to create [inaudible] utilization because, again, you always should control what your system tells you. If we have outages, we need to control with our own eyes.
Aleksandr Tavgen 00:51:43.647 So I think that if we talk about high-level monitoring or high-level observability and talk a lot about specific software industry, then this is most effective way for doing that because others is quite for different purposes. So again, there is low-level, high-level, and so-called mid-level. So it’s different things. And if you mix this, you will fail because there is no one solution which fits all.
Chris Churilo 00:52:24.535 Right. Right. Yeah, no, I completely agree with that. And I agree with what you’re saying about, yeah, low-level, high-level. And yeah, I just—so very cool. I also appreciated that you pointed out the difference between what you call stationary versus non-stationary time series. We actually have the same idea. We just call them metrics and events. So metrics are the stationary. Time series and events are the non-stationary time series. So it’s cool. It’s really nice to meet somebody where we have kind of the same concepts and ideas about observability and also directionally about how we need to really figure out how we can fine-tune these systems—well, first, to be able to understand the relationships between all the components within a system, and understand what’s causing the behavior, and then be able to do something about it. So thank you very much for this.
Chris Churilo 00:53:32.134 And it looks like you answered Vladimir’s question, so that’s cool. So if there’s any other questions, I’ll leave the line open for the next couple of minutes. And more often than not, you might walk away from this webinar and then have a question. And if you do, just shoot me a note and I’ll be happy to forward it to Aleksandr and he’ll be able to answer your questions after this webinar. Any other final thoughts about your system? Just looking at this diagram, it feels like you guys really put a lot of thought behind this.
Aleksandr Tavgen 00:54:08.516 One thing which is pretty common right now, and I know some development teams working in Spain, Italy, Switzerland, and in Asia as well, is analyzing financial data. Yeah? And financial data is a time-series data. We had much more source of information right now, given some crypto exchanges or—there is a lot of information about stocks and so on. And right now, those development teams are working for creating different type of bots for arbitrage, for analyzing or managing portfolios, whatever. Actually, they are trying to build a [inaudible] pretty often. And I have a recommendation. If you want to start with something like that, there is a very good combination. With Telegraf and InfluxDB, you can fetch data in a regular way, store in the database for you. It’s just necessary to work on your required functionality because, as I saw, different teams try to reinvent the wheel. And they put a lot of effort in that, which could be turned on solving their problems but not with how they fetch, store information and so on. And now it’s very easy. With Docker, you can just take a images and run Grafana, Influx, and Telegraf within minutes. So right now it’s become easier and easier to start with such kind of things.
Chris Churilo 00:56:09.767 Cool. I am so glad that we had you on our webinar today. It’s been very informative and actually inspiring for me and hopefully inspiring for everybody that’s on our webinar today. So we are almost at the top of the hour, and it looks like we don’t have questions, at least now. But like I mentioned, if you do have a question afterwards, just forward it to me—or send it to me and I’ll forward it over to Aleksandr. Aleksandr, thank you so much for your great presentation today.
Aleksandr Tavgen 00:56:38.450 Thank you. Thank you very much.
Chris Churilo 00:56:40.589 I’m so glad to get the chance to work with you.
Aleksandr Tavgen 00:56:43.838 Yeah. Thank you.
Chris Churilo 00:56:45.163 Thank you.
Aleksandr Tavgen 00:56:46.165 Take care, everybody.
Chris Churilo 00:56:46.330 All right. Bye-bye.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.