Why Architecting for Disaster Recovery is Important for Your Time Series Data

Session Date: Oct 23, 2018
Time: 8:00am (PT) | 3:00pm (GMT) | 4:00pm (BST)

Time Series data at Capital One consists of Infrastructure, Application, and Business Process Metrics. The combination of these metrics are what the internal stakeholders rely on for observability which allows them to deliver better service and uptime for their customers, so protecting this critical data with a proven and tested recovery plan is not a “nice to have” but a “must have.”

In this webinar, the members of IT staff, Saravanan Krisharaju, Rajeev, and Karl will share how they built a fault-tolerant solution based on InfluxDB Enterprise and AWS that collects and stores metrics and events. They also use InfluxDB for Machine Learning, which uses the collected time series to model predictions which are then brought back into InfluxDB for real-time access. The team shares the journey they took to architect and build this solution as well as plan and execute on their disaster recovery plan.

Watch the Webinar

Watch the webinar “Why Architecting for Disaster Recovery is Important for Your Time Series Data” by filling out the form and clicking on the download button on the right. This will open the recording.

[et_pb_toggle _builder_version="3.17.6" title="Transcript" title_font_size="26" border_width_all="0px" border_width_bottom="1px" module_class="transcript-toggle" closed_toggle_background_color="rgba(255,255,255,0)"]

Here is an unedited transcript of the webinar “Why Architecting for Disaster Recovery is Important for Your Time Series Data”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.

Speakers:

Chris Churilo: Director Product Marketing, InfluxData
Rajeev Tomer: Sr. Manager of Data Engineering, Capital One
Saravanan “Krisha” Krishnaraju: Master Software Engineer, Capital One
Karl Daman: Software Engineer, Capital One

Chris Churilo 00:00:00.678 So without further ado, I’m going to pass it on to the team at Capital One and have them do some self-introductions. Go for it, guys.

Karl Daman 00:00:12.433 Hey guys. This is Karl Daman. I’m a software engineer at Capital One. And I’ve been working with InfluxData products for about two years, for monitoring purposes, and other needs. And I’m going to be one of the speakers here. So I’m going to pass it to Krisha.

Saravanan Krisha Krishnaraju 00:00:32.027 All right. Hi guys. This is Saravanan Krisha Krishnaraju, I go by Krisha. I’m head of Layered Software Architect and one of the Influx admins here. I’ve been working on the product for a little over a year now, but that all kind of works under Rajeev.

Rajeev Tomer 00:00:47.822 Hi everyone. My name is Rajeev Tomer. And I lead a team of data engineers here in Capital One. And have the responsibility to manage InfluxDB for Enterprise. And, yeah, for the webinar today, and a couple listed-as the webinar title says, right, it’s around the data management tasks in time series, right? What our agenda is we’ll talk about Influx at Capital One. What’s the goal in positioning of Influx at Capital One. And then we’ll go into architecture, right and as Krisha mentioned, that we have evolved this architecture, right, means from one find to another, right? And the series are kind of might show this. And we’ll talk about that right, how means we arrived where we are. Correct? And we will speak to we’ve come from the main area of this discussion is our resiliency, right? And that’s our focus of here for the staff. And then we’ll share some performance metrics. And then at the end, yes, I will open it for Q&A.

Rajeev Tomer 00:02:01.160 So let me start with the Influx at Capital One. So one thing unique about us is, right, that unlike how Influx, at many cases, like is in start with a single tenant database, right? Application has the database for them, but Influx database. What we have in Capital One is a multi-tenant database. Where we have a centralized cluster of Influx and everyone in the organization used that cluster to manage their time series data. And especially for that monitoring domain. So we provide service to enterprise. Where different divisions in the enterprise can put their data in-the time series data, and then they can do, means, whatever they want to do with it regarding whether they want to put money that’s on top of it like others, right? Or they want to report on it. What we see today is that it’s finally being used for a multiple kind of metrics. One is the business transaction metric we see that how people are having the business process. They want to monitor those business process and they have some metrics, right, to understand how their business process is doing. So that kind of-one of the metrics-kind of metrics that we put in our database. And then, of course, they have the reporting on top of it, and alerting if the volume goes down or not.

Rajeev Tomer 00:03:41.621 Another one is our typical infrastructure health metrics like CPU, memory, and all of those kind of things. And we see a lot of databases in our trust [inaudible] and the time series focus is on infrastructure health metrics, and following up on alerts and all those things. Another one is application performance metrics. Well, people use Influx to assess their application performance family, right, whether it’s a web tier, a database tier, or an application tier, right. All the metrics they put them in InfluxDB, and then from there they drive their line chart, and alerting for testing, and all those things. The new one which we are seeing right now is the service adoption metrics. So in our organization, there are centralized services. People provide services and they are interested in seeing how well their services are being adopted, right, how much value they are giving to our organization. And they’re a new kind of metrics coming there, right, and so given this, I have provided this service, okay. Last month it was kind of this, like, okay, how many new clients have started using our service or the adoption has gone down or not. Right? So this way we have a means they understand the performance or effectiveness I would say, right. So when we think about metrics actually help performance, this modified kind of understanding the effectiveness of a service.

Rajeev Tomer 00:05:26.885 So those are the four major type of metrics. What our clients use this database for but there are other things as well. In the monitoring domain, how people use our InfluxDB. So let’s go now into architecture little bit, right and I will start from the Gen1, our generation one architecture that we had. Initially, we had-as I said, we are enterprise service, right-that we have this centralized cluster and the DR cluster as well, right and where we were mentioning the first cluster, the DR cluster, was using the backup and restore capability what Influx provides. So as you know that, yeah, it’s a backup and restore is kind of a check, find, recovery data ability that you take incremental backups, or in full backup, and you can restore. In case of kind of a data loss or something you can go back and restore the data, right. And that usually applies to the primary cluster. What we have done initially is that we will use the same utility, we’ll take the backup and we’ll shift to other sites. And then whenever we need it, we’ll restore it.

Rajeev Tomer 00:06:54.678 So that was the kind of primary topology one year ago, where we started. Apart from that, from a user’s perspective, the primary input was coming from Splunk-we have a Splunk setup, and there are certain dashboards, they’re Grafana and Influx combination [inaudible]. So for that there, yeah, a lot of volume was coming from a Splunk so that we can report. And report on kind of using the line chart ability of the former. That included the DB Influx to support the line dashboard. So that was the primary use case. Apart from that we had direct API-based data loads coming to Influx. And so is Telegraf, as you know, that there is a prime site agent for Influx. Telegraf, people use that client-side agent. So that was, yeah, the kind of setup. Primarily used for visualization was Grafana, and people were using it for full testing [inaudible], right, and anomaly detection and forecasting. And at that time we had a retention for our [inaudible], and primarily because, I mean, strictly, you not only use for operational purpose, right, it’s used for reporting, as well. And that became a little bit challenging for us as we grew, right, as in server type, we are enterprise service as more and more clients came to us.

Rajeev Tomer 00:08:40.604 Now, the 80/20 rule applies here, right. So what we have started seeing is that 20% of recent re-inserted data is used 80% of time, and the other big 80% is hardly used, right. So that’s where, I mean, either by containing the cost or management of [inaudible], right? What we see of this is that it is not sustainable. The high data retention is not sustainable, and this was becoming the challenge. The other challenges we were seeing is we’re seeing some issue with the-our DR solution. The backup and restore is not a means of-it worked fine if we had to use it for the primary site only, but when it comes to the primary site and then recovering the DR site, it was not proving kind of sustainable to us, right. As the data size was growing, we started seeing some problem with it, and we’ll talk about it later, right, more in detail. But now I would like to go to the Generation 2, and it sounds better here, meaning where we started thinking not only about the data retention but the ecosystem around them, around InfluxDB, right? So it means that send them any database, or application, lived in isolation, right. It needs to be-it needs to be integrated with the other things. So yes. We continued to go our visualization through Grafana. That was already set, right. Now, the other thing comes from background is that the equipment would like to explore the data, send the data where we are putting it in InfluxDB in a variety of use cases.

Rajeev Tomer 00:10:28.222 These are time series uses, but at the same time, we had the problem of the high data retention as it was, right. So the climb from clients-I had clients wanted to use the data in the [inaudible] there. And from our [inaudible] side we wanted to solve for this high data retention problem. And from the ecosystem perspective what we started doing is that, hey, we used to take a kind of a daily extract from InfluxDB. We started putting in a number of data lakes. All right. Just a quick one on the data lake, because at Capital One we have data lakes, it’s a S3 dish and it gives the capability to all of our users to perform analysis on the structured as well as unstructured data. And this kind of means that the rule of lake-from an Influx perspective. Once we copied the data into data lake, right, will we see its shape? This data lake is becoming now an online of storage for Influx. So it means that, hey, the online storage in InfluxDB which has certain retention, is not for our database. Now, we are waiting just six months to start that, right? That we can use the retention, right. Let’s put the 20% of data in Influx, and the rest of the data is available in the data lake. And which can be used for a variety of other things.

Rajeev Tomer 00:12:00.686 One of the most used is our use case on Lake and using InfluxDB is about machine learning. So how the model works is that since the data is copied to lake in the same format as InfluxDB uses. Our analyst data scientist can double up a machine-learning model. For example, to forecast something. And they will double up that machine-learning model using their data lake data, right. Not directly getting two InfluxDB’s. Now, when the model is ready, right, you can see in the model to-by using the infinite history available in Data Lake. So you can clean it, right? Once it is done, it’s regular, right? It goes to the model governance, right? And then the algorithm is ready to execute. And that algorithm will apply in real time on InfluxDB. So that machine learning model is deployed using their means, not on top off real time InfluxData. And you know that InfluxDB is high-speed read and write database, right. So think of it. The data is being written in real-time, you can read in real-time, and when you’re reading it, you can apply your machine-learning model. So, in real-time, you can forecast. You can detect anomalies. But you can do those kinds of things. So that’s how the whole ecosystem is working, but as that means besides what we are getting out of it is that definitely we are solving for high data retention problems. Now, our system is much more manageable. With that, I think the next challenge is still the instability of our DR solution which is-yeah, Krisha will talk about it.

Saravanan Krisha Krishnaraju 00:13:55.844 Hi everyone, thanks, Rajeev, for setting up the stage with a opening, addressing all the data models talk. So I’m going to talk about the backup restore issue that we have: why it couldn’t be a good backup and a DR solution and what did we do to overcome that? And an optical-we’ll go over some architecture diagrams, how we solved, and then Karl will come back-come and walk us through some codes, pseudo codes, and how we did the actual implementation. So with that, I want to address, what are the challenges that we have with the backup and restore. Like you mentioned, if the data grows, the time it takes to backup and restore grows as well, right? I mean, on the surface it seems to be working fine, yes. But a couple of the challenges we had is not just the time it takes to backup and restore, but the increment-there is nothing called an incremental cluster, so you have to build an empty cluster and do a full restore, right? So that’s another challenge depending upon the data you have. It’s going to take anywhere from hours-in a few hours to-depending on the data, sort of, right? And also another challenge we had was the specific proportion. If you were like 1.5.2 and below, we had some issues. The times where the backup and, actually, the restore, would fail due to an anti-entropy, that’s our [inaudible] starting 1.6.2. So we have a clone out the anti-entropy, make a good backup, and restore. So those are some operational challenges. So that-we did overcome initially and we had a set up and we did a dry run and everything seems to be working fine.

Saravanan Krisha Krishnaraju 00:15:51.085 But it is still not a stable DR solution. We were betting on, “But hey, the backup should work fine.” A full restoration would work fine and the only time you know-you’re going to be asked-we have extracted now, been evaluating our DR solution every three to six months, and then you’ve got to wait until your next exercise to make sure you restore both sites. So it works if you’re a small child with a small amount of data and all those things, right. So what we did is-we did a process solution to address this, right? So that’s what I’m going to talk about in the next couple of slides. Okay. So what we added is-we replaced the backup/restore with an Influx export/import and leveraging AWS S3 solution to move the exported data files from the primary site to the DR site. And if you were to notice here, the DR site is not standby anymore, or not an empty cluster anymore. It is a working cluster now, but it just sits there and is just a-it’s survivable for to import the data from the primary site, okay?

Saravanan Krisha Krishnaraju 00:17:18.986 We’re controlling who can read and write via a load balancer. So all our read/write directly goes to the primary site, that’s why you see the dotted line to the DR site. So this gives us the ability to make sure our DR side of the cluster-the DR side is available, and we can have monitoring, on a real-time basis, we’ll be comfortable saying that our DR site is functional, right? At any time we can switch, which we didn’t have before of the standby or empty cluster until you restore, right, you wouldn’t know if it was working or if it would be viable or not. But adding that feature, exporting all the database, transporting via AWS 3 to the DR site and importing back without that very unstable DR solution, and then we all can-it showed to be very robust right now while running on there for [inaudible]. We haven’t had any issues or anything. It’s really very solid right now. With that, I’m going to go to the next slide where we’re going to talk more the architecture in detail, how we set up, right.

Saravanan Krisha Krishnaraju 00:18:36.870 So in this architecture diagram, there are some AWS components, though to the benefit of everyone, I’m just going to take a minute or so to explain some of the AWS resources that we use so that for the benefit of people who are not familiar with AWS, right? Before getting into the actual input architecture. So, you see on top, we have a-what’s called a Route53, which acts like a DNS switch. That’s all. There’s a lot of devices out there, they have five connectors, they are a DNS switch. It’s something like that in aid of this [OI?]. And then we have a load balancer. It’s called Elastic Load Balancer, some of us always-for a regular Load Balancer. And then the other one I want to talk about is AWS 3. It’s simple. It’s always storage, well, a storage service. It’s an object-based storage service where you can access from your computer any of the AWS services to store your dedicated of data and retrieve. And if you are new to AWS, this is one of the services that AWS started off with ten years ago. This was the service, and then we have all kinds of reasons not only for us but for any customers to use AWS to totally rely on this. I mean, if you look at-it’s been mentioned over the last 10 years. It adds the highest of availability, the reliability and durability across the [inaudible]. It’s one of the servers that we trust a lot, not just us-many of our [inaudible] database customers.

Saravanan Krisha Krishnaraju 00:20:20.862 So we leverage all of those components to build our architecture. So what you see in this picture is on the right side, you see a region one and then a region two, with AWS-they have data centers that are across the continental United States and across the globe. So the region refers to a geographical location where we have multiple zones. Zones kind of refers to like a physical data centers. And if you look at physical data centers as a physical building, each zone or a viability zone are separated-totally separated by buildings, and power, and what other infrastructure particular components, right? And they are connected with low-latency lines, so you have very high response times across that. So, the way we architectured Influx is-so we want to, being a Capital One or being a bank and a credit card company, we need to have higher availability for all our customers. Our customers are primarily a NOC type customers where-support personnel, they sit in a big room and then watch all kinds of Grafana dashboards for the industry metrics and whatnot.

Saravanan Krisha Krishnaraju 00:21:59.618 So what we did is we have-it looks like it’s in S3, and that’s why you see three data nodes accompanied with three meta nodes, and three data nodes are put in different availability zones leveraging the AWS high-end infrastructure, so that even if you lose two AWS nodes we have got-one in fact our customers, they will still be able to retrieve read/write, okay. And also we have an admin, home-grown admin tool-I’m going to leave it to Karl to come back and talk about even about that which we use for all DB admin staff. We have the same setup on the DR side regions, the exact same set up. Three data nodes, and three meta nodes. Which one’s the cluster? Okay? Now I’m going to talk about the data flow, the flow of the data and how we do the import/export, and everything happens. So, the number one is all the traffics are routed to Region 1. So the DMS switch or Route53-we have built in some logic so at any given time all the traffic will be automatically routed to the load balancer that’s under region one. And then the Region 2 is we have-we have written an Influx-we have written a code, an export code that leverages Influx aims the command to export the data from all the databases and puts them to a S3 bucket. Okay?

Saravanan Krisha Krishnaraju 00:23:48.834 And the code flow, so we leverage AWS’ current-one of their current features, what’s called a cross-region duplication. So this is now a very strong feature within AWS. What it does is it takes the objects, whatever you put, in real-time-moves it to the other region, whatever you want to do. So that’s why-so we leverage here-what we do is we leverage AWS features to transport the data, right? So that it is totally off our site and is taken care of by AWS. The last one is we have another code on the DR which runs on the DR site which imports-which reads all the files and imports that data into this structure. So with this, you know, it goes on a cycle, like we have secured and it runs on every 15 minutes. So, at any given time, our DR region-the cluster in the DR region is like 15 minutes behind the primer. This is well within our established SLA. So, like any other company, so we have all the risks of the [inaudible] team that defines what is the SLA and what is the recommended time in all those. And it’s very well within all regulatory times. And also we have plans to kind of optimize this core group.

Saravanan Krisha Krishnaraju 00:25:26.352 Okay. With that, I’m going to go to the next slide where we see here-what happens when we lose the primary region? What happens? Okay. So it’s exactly the same, all we’re going to do is reverse what we did when Region 1 was active. Right? Number one, the traffic will be allowed into the region this time. Right? And everything-this is assuming that if, right after that, the Region 2-Region 1 is backup or all right. It could be any kind of disaster. It can be the net goes down or that we’re not able to get to Region 1 depending upon what alert. As soon as Region 1, now has become the DR, comes up, we’ll start the same flow that we were doing over here when Region 1 was active. I believe you get the idea, so when we come to the Q&A, you can ask questions that relates to that. With that, I’m going to hand over to Karl-

Karl Daman 00:26:34.128 Hey, this is Karl again. So I’m going to talk some more about details of the DR script that we developed as well as some other things. So I wanted to mention here that this admin node that Krisha was talking about-so it is very important to monitor your Influx cluster. If you’re going to use a Telegraf to monitor it, make sure that the metric it collects goes to an open source slot. Okay? Not the one that you’re monitoring. Okay? Very, very important. So that’s what we have done here. We have an open source Influx on the admin node. And the metrics collected by the Telegraf agent go to that node. So that we can monitor it with other tools including-we have Chronograf on there too and also we have a custom admin tool written in code that does a lot more feature than Chronograf. And we use that heavily to create database and view data and a number of other administrative activities. And also that is DR-enabled too so that both of those are in sync so that we can switch the easier one and get the metrics from the cluster. And it’s used to monitor Influx operating.

Karl Daman 00:27:57.382 So going to the next slide here. So I want to go into detail about the export/import script that we wrote. So what you see here is pseudo-code, okay, not the real code. We have a lot of monitoring and a lot more complexity into the actual code. But just here we simplify it so it makes it easy to show. So, first of all, we have an export script that is running on the Region 1 data node. Okay. It’s just one of the data nodes, okay, because we have replication three. So all the data is replicated and running the [inaudible] command you can see all the data is replicated. As long as it’s replicated then you can use one of the nodes. So we use one and we set this duration. So you see the time there, it says, “Duration equals 1800” so in this example it’s one hour. But we can do 30 minutes to-we can do anything really. So you put the duration in there and that’s how often it’s going to run. So every time it runs, it’s going to set the STARTTIME equals the last ENDTIME. So that it always has a block that it operated on. And then the ENDTIME there is just going to be STARTTIME plus duration so it’s how much more of a window did it grab. Okay. And the first operation it does is just show database. So we want to get all this database. We want to collect the files from each database individually instead of one big file. So it just does that and then the next thing is a for loop, it goes through each of databases in the list and it runs Influx Inspect. Okay. So that is a tool that comes with Influx and it allows you to get the data out in a text file in this raw format. So it looks like line protocol with a few command characters and different blocks in there. But it is easy to use and it works so we used it here.

Karl Daman 00:30:07.329 We extract each database and we used the -compress feature so that it’s smaller. These files are pretty big if you don’t use the -compress, and then you just put STARTTIME, end time and then the-where are you going to put the file at? So it runs and then it sits on that for a while and then it completes. The next thing it does, is the alpha file it puts into S3. So it’s going to take some time to do that too, and then once it’s done, it does it for all the databases. And then it just updates the ENDTIME with the LASTENDTIME. So we have-this file goes into an S3 bucket that is using the AWS bucket replication thing, so that whatever files you put in it, shows up in the same named bucket or a different bucket in another region. So over in Region 2, we have the other cluster running, the DR cluster running. It doesn’t take any data. It doesn’t take customer data, or client data, or anything. So it’s just receiving the data from this script. So on one of the data nodes in Region 2, we have input script running, and it runs repeatedly and it’s going to put the data in the cluster. So the first thing it does is it’s going to copy all the files from that bucket in Region 2 and put it in a local directory, okay. All the files, okay because we’re kind of using the S3 as a state transaction thing to hold whatever files are there that need to be imported. Okay?

Karl Daman 00:31:55.343 So, and then, we just get the list of files and then go through a for loop, and it just runs the import command. So influx -import will allow you to insert data directly into the database and it will take that output of the [inaudible]. So it’s simply line protocol so it’s easy to read the file if you had to. So then it just deletes the file in S3 and that is the command that terminates the state of the backup there. So that we don’t do it again or anything, and then it exits a loop, and it just does this over and over again. These run continuously and allow us to have a resiliency of our region of about 30 minutes. So if there is a disaster then we can switch our clients to the other region, Region 2, and then it would be 30 minutes behind or whatever the timeline would be set, but 30 minutes is a good baseline. So, and this does work. We have it working well. And, so in our disaster recovery exercise, we can recover in the second region.

Karl Daman 00:33:18.899 So, another thing we want to talk about is the performance monitoring. I’m moving to the next slide. Okay? So we use Influx for collecting performance metrics on a number of different things. Here we see an example of the metrics on itself, okay? So we have Chronograf collecting metrics on Influx, but going to a different Influx so we can monitor it without-if it has an issue, we can still monitor it. So in this example you just see something in Chronograf that would show us a graph. You can use Grafana, as well, it’s just Influx data. But, you’ve got Cardinality, HTTP Requests, Query Requests, Client Failures, and these are per minute. So this works very, very well for monitoring performance metrics. So you can have this Telegraf agent installed on many, many client servers and collect this data. So this is just one example, but we have other graphs on other things, such as the other topics we covered. Okay. So I will pass it back to Krisha, and-

Saravanan Krisha Krishnaraju 00:34:45.779 Yup. Thank you, Karl, that was very good and [inaudible] the session we have so far. Chris, with that, we can up for Q&A at this time.

Chris Churilo 00:34:56.192 That sounds great. So, if anybody on the call has a question, raise your hand or you can go into the Zoom app, there should be a Q&A button or a chat button, the icons in the application, just tap on that and then you can type in your questions or you can raise your hand. And I think we have enough time where I can also unmute lines and have you speak your questions directly to our friends here at Capital One. So, in the meantime, I really love that you are monitoring your monitoring solution, Karl. I think that’s pretty important for people to understand that, you want to make sure that, not only is it going to be resilient with the DR infrastructure that you’ve put in place but you also want to make sure that your monitor is also going to be, active as well. So I appreciate that you shared that with us. So, if you guys were to just, kind of, take a step back and review what you guys just presented, and if you could go back magically to the beginning of the project, what advice would you give to yourself that you wish you knew from the start that might have sped things along a little bit faster or maybe would’ve helped you take a different approach?

Saravanan Krisha Krishnaraju 00:36:13.890 This is Krisha, I’ll take that. I’ll take a stab at it. Definitely the backup restore. If we had known we’d have the challenges up front, if we had that knowledge we would have definitely, honestly, quite quickly and went with something else. We had a little bit of time crunch where we had to-what we learned, right, like I mentioned before, on the surface it looked like not a bad solution, right? It’s still not a bad solution, it works for a smaller shop or a dev, or a non-product environment, right? But, even for our volume, we were able to fully do it. With the dry run and everything it seemed to be checked out for us, but for the period of time, all right, when we had more data, then it didn’t work well. But, trust me, we did plan that and we did size and we did plan all those things, even with archiving and everything. But still, it didn’t help. So I believe If we had known about the [inaudible] with the backup would have definitely helped us quickly. Luckily we didn’t have any incidents. So we were lucky, so I wouldn’t brag too much on the luck. Rajeev, or Karl, want to add any other stuff?

Rajeev Tomer 00:37:34.465 I think you can fill needs definitely around backup and restore. It’s just I think that there are limitations around using backup and restore for the DR. That’s kind of a main thing, right. We invest in [inaudible] area capability. We need the capability. But at the same time we also-what we didn’t know is that the backup and restore is probably a good pick for a primary, not for the DR. We have to restore all the data under the SLA, right? And how do you manage SLA, over kind of value [inaudible] loss? So I think it’s just the capability perspective, where it really fits, right? And then again, I think some of the things was with what was surrounding technology capabilities you have with you. And being in the AWS, we have different capabilities where we could support something like that Influx export/import data by using S3 and other kinds of things. So the whole technology solution is more available to Influx import/export for DR versus backup and restore. And Influx Inspect took us a while, actually, so we followed the same data center kind of approach, backup and restore, and all those things. Then we soon realized, well, maybe we have a better option. So, yeah, that’s about it, please.

Karl Daman 00:39:21.707 I wanted to add another learning thing. So, like I mentioned, starting from the beginning, make sure that you monitor the thing and keep the monitors, not on the Influx that you’re monitoring. So, absolutely critical. We had a few failures that we were monitoring on the same cluster and that was not good. So just make sure you keep an open source one. And that one is simply for monitoring the cluster that you have the enterprise version on. So we got-I see a question in the chat there. I want to take that one. Do you query-?

Chris Churilo 00:40:03.361 Yeah, do you want to-please read the question out loud, thank you.

Karl Daman 00:40:08.911 Yeah, so it says, “Do you query all your data 400-day retention in one query in InfluxDB, or do you do that in AWS?” Okay, so we have a few clients that have 400-day retention policies. Generally, that is just to keep the data. They don’t query it all. Nobody queries it all at once, because that could actually put too much load on Influx if you do that. So you don’t ever want to query huge windows like that. So what they do is they query windows like a week at a time, and then they move the window and query another week. And then they do that in their machine learning program. They can query windows. Another team also does query in AWS, and the one lake thing that we mentioned before is the data is all exported into the one lake, permanently. So, if it ages out of Influx, or they need the data in a great amount of data all at once, then they can get it from AWS. And that would be clear with a different tool completely.

Rajeev Tomer 00:41:24.286 Yes. I would like to add to it-I mean it’s a good question here, right. As a service provider right, we cannot control, right what they will query. Yes a client can query for 400 days, right? They can issue a query for 400 days, right? The problem between first [inaudible], right, when the client queries InfluxDB, they query as if it were a relational reduction, right. So you do not use the time on that one. So you are querying something different, right? You’re using some of the parameter other than time parameter which is very much taxing to InfluxDB. And that’s where our latest solution helps us, right, because it’s so good. If you don’t want to query this data by time, and want to use some of the information, right. And you want to use the kind of heavy long retention, right, and kind of that data, right? Then go to lake. We have the capability in lake where you can query this data. So that’s helped us, right? We have helped people. Hey, we have data lake as well, you can query this kind of information from it.

Chris Churilo 00:42:37.153 I think that really makes sense. I love that idea because yeah, instead of hampering your users, let’s just try to accommodate. What are they trying to accomplish? And then just point them to the right way to do that. If you want to look at all that data over the 400-day period, just go create against the lake. You can come up with a more skinnier, time-bound query, then yeah. Go crazy in InfluxDB. So I think that’s a really good approach for your users for sure.

Rajeev Tomer 00:43:07.871 Yeah. We have one thing in common at Capital One culture is we very much on a three for purpose thing, right. So the solution should be for a purpose. If you want to do this kind of thing now InfluxDB is the best for it, right. And if you want to ask this kind of question probably lake or a particular host is a better place, right? And so we go and get the [inaudible] around that.

Chris Churilo 00:43:29.962 Cool. And then Manujit agrees. He says, “Awesome. Thanks. Lake is a great idea.” So, I think he appreciates that as well. All right. So we still have a few minutes left. If you do have any other questions, please feel free to post them in the chat or the Q&A. Or if you want to speak out loud, just raise your hand. I can unmute the lines right now and sometimes doing that we can open up the conversation further. All right. So we do have a question from Bruce. And Bruce asked, ‘‘What version of Influx are you using?’’

Karl Daman 00:44:05.923 We’re using the Enterprise Influx, 1.6.2 on all the nodes. Generally, we want to keep it up to date as it comes out. So I think there’s a newer version of [inaudible].

Chris Churilo 00:44:20.743 All right, Bruce. Let us know if that answers your question. And maybe you can talk a little bit more about your machine learning framework. So, what are you using underneath?

Rajeev Tomer 00:44:37.660 So, I think I said that we are the service provider. We manage time series data. Our client needs that data and there’s a different growth site inside Capital One. But for a different kind of analytics, right. Using different channels, whether what they do is kind of operational analysis, or advanced statistical analysis, or analysis using machine learning as a mechanism. So it’s primarily in that area, but I can speak a little bit about it, right. Currently, Spark MLib, Spark Machine Learning (ML) library, and [inaudible], the primary technology which people use to perform machine learning.

Chris Churilo 00:45:31.984 Excellent. We’ll keep the lines open for just a few more minutes. So I just want to know, what is the stance of Capital One when it comes to open source software?

Rajeev Tomer 00:45:46.988 We are open source first organization. That’s the first thing we see if, okay, if we ever need can be met by open source. I mean, across the organization we use open source, right? We use open source and at times we open DB backups. But it is-

Chris Churilo 00:46:13.237 Awesome.

Rajeev Tomer 00:46:13.734 -very clear strategy for the last three or four years. And so it’s kind of [inaudible].

Chris Churilo 00:46:19.971 That’s great to hear because obviously we have-

Rajeev Tomer 00:46:21.575 We have the means and we have found that it makes a very good advantages, right. So with using open source because then we do not have to wait for new capabilities when the vendor provides them. We are an engineering organization. We would like to-if the capability is not there in the market, if the tool doesn’t provide, we would like to engineer our solutions, right. On top of the tool and open source really helps us do that.

Chris Churilo 00:46:57.234 I think that’s a great answer and I think people who, especially, are potentially looking for a job and may be interested in coming to Capital One, I think those are probably words that they want to hear. That engineering-led organization and that open source is pretty important. I think-

Rajeev Tomer 00:47:12.684 Pardon?

Chris Churilo 00:47:13.518 -more and more potential employees are looking for companies with that kind of a mindset.

Rajeev Tomer 00:47:19.846 Yes, we are hiring. My group is hiring. Really means if you like to innovate, right. If you like freedom to do things right and new ways, then, yeah, Capital One is a great place, right. We are technology company first and bank later.

Saravanan Krisha Krishnaraju 00:47:44.039 Yes, absolutely, I want to add to that, Rajeev. We are a technology company. We embrace all kinds of technology. You have like a Google or Apple-type environment here. Not just the physical appearance but the tool types are the openness. You have an idea? Yeah, you can come and-you can directly influence if you have a technical idea-you can influence and you can get it done. It’s a very good place to work if you’re a technical guy or girl [laughter].

Chris Churilo 00:48:15.717 Excellent. All right, any last words to share with our listeners today?

Rajeev Tomer 00:48:24.687 Yeah, [inaudible]. Thank you for this meeting and we wanted to share our experience, our journey, with you folks. As we face a few problems in the beginning how we solve it. And if you are on the same boat you don’t have to repeat the same things and come to the final solution after some hiccups, so. [Handy?], yes.

Chris Churilo 00:48:50.355 All right, so if you guys do have any questions afterward don’t worry. Just send me an email. You guys have my email address, so. And I know I’ve done this in the past. So I will be forwarding any of your questions over. I will be editing this video and then I’ll post it so we can take another listen to it. And you will get an automated email with that information tomorrow morning. But you can also use the link that you have today for registration and once I get that edited, then we’ll post it up there. I want to thank our friends at Capital One today. I think it was very informative. I love that you shared your journey and talked about the ups and downs of the technology. I think that it’s important that we share these things with each other so that we can make sure that the solutions that we’re working on are going to be more robust. And, really, it really kind of points to the spirit of the open source community about collaboration and making sure that we all work together to make the best solution out there. All right, with that.

Rajeev Tomer 00:49:51.647 And thank you [crosstalk].

Chris Churilo 00:49:52.910 I want to thank you guys once again and thanks to our audience. I look forward to speaking with everyone again and I want to hear about all those really great projects that you guys are working on. Thanks, everyone and have a fabulous day.

Rajeev Tomer 00:50:09.557 Thank you.

[/et_pb_toggle]

Why Architecting for Disaster Recovery is Important for Your Time Series Data

Watch the Webinar

Session Registration

Product & Solutions

Developers

Company

Why Architecting for Disaster Recovery is Important for Your Time Series Data

Watch the Webinar

Session Registration

Product & Solutions

Developers

Company

Follow Us