Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Space and Reliability Using Telegraf, InfluxDB Cloud and Google Cloud
Session date: Jul 27, 2021 08:00am (Pacific Time)
Today, access to space requires custom engineering, driving up costs, unpredictable schedule delays, and increased risk. Loft Orbital is changing that.
Loft Orbital flies and operates customer payloads on their microsatellites as a service. Companies turn to Loft Orbital when they want to focus on their end-use, with Loft Orbital operating their satellites using its mission-agnostic, flexible operating system and interfacing technology. Loft Orbital’s Payload Hub Technology provides clients with a modular payload adapter which can fly any payload on identical, commodity satellite buses it keeps in inventory while Cockpit, its mission control system, is used to operate all customer missions as a single constellation. By standardizing this technology, Loft Orbital has been able to deliver unparalleled speed-to-space without sacrificing reliability. Discover how Loft Orbital uses Telegraf, InfluxDB Cloud and Google Cloud to collect and store IoT sensor data from their equipment - including spacecrafts!
In this webinar, Caleb MacLachlan will dive into:
- Loft Orbital's approach to QA-ing their code and enabling better performance monitoring
- Their methodology for monitoring their infrastructure, including servers and containers
- How a time series platform empowers long-term trend analysis
Watch the Webinar
Watch the webinar “Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Space and Reliability Using Telegraf, InfluxDB Cloud and Google Cloud” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “Spacecrafts Made Simple: How Loft Orbital Delivers Unparalleled Speed-to-Space and Reliability Using Telegraf, InfluxDB Cloud and Google Cloud”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Caleb MacLachlan: Senior Spacecraft Operations Software Engineer, Loft Orbital
Caitlin Croft: 00:00:01.091 Once again, hello, everyone, and welcome to today’s webinar. My name’s Caitlin Croft. I work here at Influx Data, and I’m very excited to have Caleb joining us from Loft Orbital. He will be talking about how they are using InfluxDB. And as a bit of a space nerd myself, it’s kind of fun seeing companies like Loft Orbital using InfluxDB. Once again, please pose any questions you may have for Caleb in the chat or the Q&A. We will be monitoring both. The session is being recorded and will be made available later today. All right, without further ado, I’m going to hand things off to Caleb. Are you ready?
Caleb Maclachlan: 00:00:45.013 I am ready. Can you hear me okay?
Caitlin Croft: 00:00:47.280 You sound great.
Caleb Maclachlan: 00:00:48.681 Perfect. Thanks, Caitlin. My name is Caleb MacLachlan, and I’m a software and spacecraft operations engineer at Loft Orbital. I’ve been writing software to control satellites and rockets for most of my career, and recent organizations that I’ve been a part of have made key use of InfluxDB to do so. So today, I’m going to talk about how we use Influx at Loft. I’ll start with an introduction on us, who we are, and then cover why we chosen Influx, how we implemented it for our time series data, as well as some of the challenges that we had as we were implementing it. And at the end, as Caitlyn said, we should have plenty of time for questions and discussion. So before I get into who we are at Loft, I want to give a little background on the space industry. Traditionally, when you’re building a satellite - and by the way, you’ll hear me use the word spacecraft today. We’re using those interchangeably - it’s an incredibly expensive proposition. You need to design a bus, which is the part of the satellite that keeps it alive in space. And then you need to find a launch, which by itself is going to cost millions of dollars. And then once you get to space, that’s just kind of the start of the cost of having a satellite because you need to have 24 hour a day crew to support the spacecraft operations, you need experts in attitude control, remote ground stations, orbital dynamics, thermodynamics. It’s just a really big support group that you need once you have a satellite on orbit. What this means is that access to space is generally reserved for large, well-funded companies and national governments. But we’re starting to see a massive change in the industry.
Caleb Maclachlan: 00:02:35.297 Launches are getting cheaper thanks to innovators like Rocket Lab and SpaceX. And at the same time, technological advances are enabling the creation of exciting new payloads, doing things like climate research, crypto and space laser communications, and computer vision on orbit. So what we’re doing, our mission is to take all the complexity out of that equation and provide space as a service. A startup or a multinational corporation or government can put all their resources and energy into what they really care about, which is their payload and their mission, be it a camera or a laser communication device or whatever it is. They can focus on that and they let us focus on what we’re good at, which is the space side of things. And so, yeah, as it says here, we’re a series A company. So we’re still pretty young. We’ve raised 20 million in capital, so we’re really kind of ramping at this point. So how we handle this kind of what we do to make this possible is take payloads from several customers, mix and match them based on what they need, and put them into what we call our payload hub. That’s our hardware and software platform that allows us to share space craft resources between customers. So before, a customer would have been required to build their own bus, the bus again is kind of like you can think of it, if it was a car, it would be the power train, the engine and the fuel tank and all that. It’s basically the solar panel, the way that it steers, the communications antennas. Before they would have had to have their own bus. We share a bus between several customers by plugging them all into this payload hub, which is kind of like an adapter. We match our customers based on their needs. So maybe one customer needs a lot of power and then another customer doesn’t need that much power, but they need to be on all the time so we can just kind of match what those different customers need and package them together into one box that we can integrate with the spacecraft bus.
Caleb Maclachlan: 00:04:41.057 And we have multiple different providers that sell us these buses, we don’t build them ourselves, which kind of cuts down on our cost. But you can imagine maybe if a bus is a power train, we could choose the power train from a Tesla model three, or we could choose the power train from a Ford F 150, depending on the needs, right. So we have multiple bus providers that we put our standard payload hub onto, depending on what individual payloads, and payloads being the cameras, antennas, whatever it is, crypto mining in space. Whatever it is, all lives in that payload hub. And a big part of what we do is we operate it for them as well, so they don’t need to know anything about orbital dynamics or ground stations or any of that kind of stuff. We just handle all of that for them, and that means they can have a really short timeline because we have scheduled launches in pre-procured buses. They can have a really short timeline from saying, “Hey, I want to go to space.” To actually having like a satellite on orbit with their payload, doing their thing in a matter of just a couple of months instead of a couple of years with a tiny fraction of the capital expenditure they would have had to have before. So, for one example, this is our first - the first mission that we put together, it’s called the YAM2. YAM stands for yet another mission, which comes from the idea that we hope that we can just crank these out at a really rapid clip. As time goes on, we just want to get faster and faster with cranking these out and they all have different things on them. But the idea is it should be mostly abstracted away from our perspective due to the standard spacecraft buses.
Caleb Maclachlan: 00:06:22.430 And you can kind of see that there’s two halves to this spacecraft. There’s this giant world cube, that’s the payload hub. And then that silver part that you see on the bottom there is the spacecraft bus. You can’t see it from this angle, but on the other side of that would be the solar panel as well, which is how this spacecraft’s been generally powered. So similar, but different is the YAM3 mission, which we developed in parallel. And you can see it also has that gold cube as part of it, but it looks a little different because it has different custom payloads on it. And the bus part, that lower silver box is also totally different. So this is a completely different bus. Satellite is totally different, but our part of it is very similar. And actually one of the payloads is shared between like there’s two copies of it, one on each of our missions. And from the customer perspective, they don’t even know that - there’s nothing for them to see which bus their satellite is running on, even though those buses be completely different protocols and operate in totally different ways. We’ve abstracted away all of that from the customer perspective.
Caleb Maclachlan: 00:07:32.053 So where are these missions now? Well, something pretty cool happened a little less than a month ago. They are in space, on Space X’s Transporter-2 mission, which is what you saw in the first slide on the launch pad. You can see YAM2 on the right and YAM3 on the left in those blue boxes. These images were taken shortly before they were both deployed off the rocket. So the spacecraft side is a really awesome part of what makes our company special, but that’s not all. A lot of our secret sauce is actually in our mission operation software. We’re pretty unique as a space company in that we don’t have a dedicated spacecraft operations team. And that’s because we rely really heavily on automation. There’s no way to scale to the level that we need to while maintaining a completely manual operations. We also couldn’t manually interact with all of our customers because there’s just too many customers with too many needs that if we had to have people handling every single customer need, we would just have to have a giant team. So we let them directly control their payload to include maneuvering the spacecraft in certain situations within safety envelopes. So that’s the product that I work on and that’s what we’ll be talking about today. Cockpit is what we call our mission control system. Here’s our team. We’re the ones responsible for building the software we use to fly the spacecraft and giving our customers access as well. We’re also responsible for - we’re the primary ones responsible for actually flying the spacecraft in situations where we do need manual operations. And commissioning the spacecraft is one of those, like the initial checking it out on orbit, making sure it’s okay, turning everything on gradually and carefully, right. So we’re based in San Francisco, but we also have big offices in Colorado and France. This picture was taken while we were waiting to make contact with our first two spacecraft. You can see our France team dialed in on Zoom. We all look pretty happy, but I have to say that some of the most nervous I’ve ever been. Fortunately, things went really well that day. I think one thing that we’re really proud of as well is that I don’t think a company has ever launched a custom satellite of our class with lower funding than we’ve raised. And we put two up in just a couple of years of operations, so we’re really proud of how far we’ve come. And I think the future looks pretty bright for us.
Caleb Maclachlan: 00:10:01.820 So now that you know a little about who we are, I want to go into why we chose InfluxDB. I’m going to start by talking about our needs, what we needed from time series database and the problems that we were hoping to solve. So the core technical challenge that we are trying to solve is the need for handling time series data for safely flying the spacecraft, that’s the number one thing. You can think of our satellites as fancy flying servers. So the need isn’t all that different from what a data center operator would need, except that the value of a satellite is measured in millions and they’re flying eight times faster than a bullet, and there’s no way to go up there and fix it if something goes wrong. So because of that, our spacecraft are just covered in sensors and streaming down telemetry, which is what we call the measurements from all those sensors and what the flight computer is telling us at a super high rate. We need to be able to handle on the order of several hundred million measurements per day within the next few years to cover our expected fleet growth. In addition to that, we also need to be able to visualize this data in near real time in order to safely fly them and make critical decisions. We also need the data to be easily accessible to engineers and customers who are not software engineers. In addition to that, we also need to be able to zoom out and look at long term data. So maybe I’m starting to see a temperature sensor is getting especially hot or hotter than I’ve ever seen it before, I might want to be able to zoom out and see the last year of data all at once and see is this a trend that’s periodic because of the seasons? Because winter is actually warmer in space, northern hemisphere winters’ warmer in space than the summer is. So I need to be able to trend on that long of a time scale without sacrificing performance. I need to be able to do that kind of thing quickly, and as I said, there’s a huge amount of data. So we needed a solution that would let us easily do that.
Caleb Maclachlan: 00:12:08.674 We also needed a solution that would make it super easy for our engineers to access the data, because basically, as I said, there’s just such a huge volume of it. And the easier it is for people who are involved with the spacecraft to see that data and get the data that they need quickly, the more safely we can operate. So we just need to make - one of the key pieces was availability and accessibility of the data. Finally, we needed a solution that would let us handle limit checking and alerting pretty easily. As I said, we’re an automated shop here, so we need to be - we can’t have someone looking at every piece of telemetry all the time. We want to be able to get paged if something goes out of its range or is doing something unexpected rather than having to have folks just laser focused on actually looking at it. Another kind of unique need that we have is the ability to share our data securely with the people that need to see it. So as I said on each of our payload hubs, this was our YAM2 payload hub here, I believe. On each of our payload hubs we have several customers, and all of those customers do access to their telemetry. So usually their payload has its own little computer attached to it that is going to be recording how hot things are, how much current their payload’s using. It’s going to be reporting how many pictures it’s taken or packets sent or whatever, right. Whatever important metrics about their payload exist, they’re going to want to be able to see that. In some cases, they will need to see that in near real time as well, just like we do. But in any case things to have easy accessibility to it, so we needed a way to make sure that we could efficiently and safely share that data with customers without compromising any other customers’ data on the spacecraft. We don’t want one customer seeing any data from another customer.
Caleb Maclachlan: 00:14:13.414 In addition to that, as I said, we buy these spacecraft buses off the shelf from bus providers. And so those bus providers need to also be able to see the data that is produced by their half of the satellite if we need support from them, right. So we needed a solution that let us bring the bus providers in without seeing any of our customers’ payload data, right. So that was one of our key use cases that we needed to solve, and we needed to do it in a way that is equally performant to how we get to see the data ourselves. A third need that we have and use case is monitoring the performance of our software, and part of why we need that is we’re flying our satellites in what’s called a low earth orbit. That means they’re going around the Earth roughly every 90 minutes at seven kilometers per second. When you have a - like if you’re using satellite internet or, well, it depends on the satellite internet, but definitely for satellite TV, like direct TV or something like that, you’ve got a satellite dish that is pointed in one spot, right. Like it’s mounted and aimed at like usually South a bit, pointed to what’s called the geodome, that stands for geostationary orbit satellites. And those satellites orbit the Earth every 24 hours, so they don’t move relative to you. You can just have a dish pointed at one spot and you’ll always have a signal. For us, we only get to talk to our satellites about 10 minutes out of every 90 minutes because they’re just moving really fast relative to us. And so we have antenna’s in Norway and Antarctica that we can use to talk to them, but we just have those limited windows. So in that 10-minute period, we need to download everything that happened over the last 90 minutes. And operators need to be able to see that data in near real-time so that they can make decisions on what they need to do to keep the spacecraft safe for the next 90 minutes.
Caleb Maclachlan: 00:16:21.519 So it is kind of a high-pressure environment and one where you don’t necessarily have a lot of time to think and react. And in that 10-minute period, we may be [inaudible] ingesting 10 million total measurements. So we need a system to be very low latency and performant. And our stack is built on Python and Django which doesn’t necessarily come naturally to that combo. So this is something that we need to have - we need to be very proactive with tuning our system for performance and finding bottlenecks and making sure that it’s working as well as it possibly can. So we were looking for a solution that would let us monitor our performance, the performance of our code, specifically in production as well. So that’s one of our needs that we were trying to solve, and we’re looking to a time series database to help us tackle. So that’s kind of the summary of our needs for time series data. What we were using before is, as I said, Django is kind of the backbone of our stack. And we were using PostgreSQL as our database. We didn’t really spend time doing optimizations for time series data in Postgres, so I’m sure that we could have tuned it to be a lot more efficient than it was. But we knew that even with tuning it, that would be a very difficult task to make it perform to the level that we needed and would have required us to really change how we were doing things in Postgres. And it would have kind of taken us outside of what the Django or the object relational model could reasonably do.
Caleb Maclachlan: 00:18:07.867 One of the things that I really like about our stack is that we use GraphQL for our API layer. Most software these days, so probably primarily relies on REST, and this is kind of an alternative. And basically what that does is allow the client side to craft the query that they want and basically decide what data they want rather than having to pull from a few defined endpoints. So that’s really, really powerful, I think, for our customers to be able to explore the API, to be able to - to be able to get exactly the data that they need and not a bunch of extra stuff that they don’t need. So we use that for our system for exposing, too - with the right authorization, exposing for our customers exactly what they need so that they can integrate with us at an API layer rather than using some web GUI or something. Most of our customers are savvy enough to be hooking up to us through our API. And then we use Graphene, which is a Python library that lets us connect our database layer to our API so that it’s really easy to explore our Django or from our API.
Caleb Maclachlan: 00:19:24.688 So when we realized, based on our needs and what we had, that we need to do something for time series data, why did we choose Influx? One reason is just familiarity. I’ve used Influx for a few years at past jobs, so it’s something I’m personally aware of and know how to use. And one of my colleagues was also using it for other projects as well, so it kind of stood out from that perspective. Another reason is that it has a really large user base, a lot of people use Influx. You know, there’s things like this talk happening right now, right? There’s ways that you can learn from other people who are using it and a good and active community. I personally like the shared slack. I get on there and follow that. And it’s great to have a place that I can ask some questions. And then another really important thing, what you see here on the right is the Grafana, a query builder. We use Grafana super extensively, and I’ve used in past companies, too. It has a ton of flexibility, which is really perfect for a company like ours in a use case like ours. So I knew that switching to Influx, we would instantly have this huge added value of being able to explore the data in a really like auto complete kind of way. So I should have had a gif here or something, but basically you can choose your measurements, it gives you a drop down list, you start typing, it autocompletes it. You choose your fields and it auto completes those too. If I have no idea what - we have thousands of channels of data coming down from the satellite, right, and I don’t know what all of them are by heart. So it’s really great to be able to just start typing and get, “Yeah, here are the 10 fields that match what you’re saying.” And just put those right in the dashboard. It’s extremely, extremely powerful. And I’ve seen this in multiple instances, really revolutionized the relationship our engineers have with the data that we get being able to really quickly understand what’s going on and make decisions and fly the spacecraft more safely. So knowing that this was an option was a big selling point of Influx.
Caleb Maclachlan: 00:21:40.933 And then another thing, one of my favorite features of Influx is subscriptions and how Kapacitor integrates with that. I was coming from a purely 1.x Influx setup before this, and we use that a lot. So being able to, as measurements come in, take action on them, which I knew would really help us for segregating data for individual customers and bus suppliers. And then the last thing is just that it was really quick to get started on this. I think it was like it took me less than a day to have like a proof of concept of here’s Influx and Kapacitor, and it’s all running in containers, and we can put data in it and show it in Grafana. That was a really quick process, and we’re moving on a really fast timeline in this company. So that was a big selling point as well.
Caleb Maclachlan: 00:22:32.972 So how did we implement this for Loft? So here’s a simplified view of what our architecture was before we moved to InfluxDB. One kind of quirk or need for us is that basically at this point, our spacecraft were not in space, right. Our spacecraft were in a clean room on the ground and we had local servers basically physically hooked up to each one. The reason we need to have local servers and not do it in the cloud like all the other cool people are these days, is that we sometimes have to work in environments that don’t really have a good internet connection. And a good example of that would be at our launch base. So when we go down to Cape Canaveral and start to strap this thing onto the rocket, we need to do our final check outs, but we don’t exactly have a fiber internet connection available to us. So we needed our whole stack to be portable on a server that we can move around. And this is a very simplified view because otherwise my diagrams are going to get way too complicated. But basically, when we started out, our spacecraft control service was grabbing data from our spacecraft and then putting it into Postgres. And then we had Grafana hooked up through our graphical API to that Postgres instance pulling the data. And it was not the best, basically, which is by design. We were still a very - when the system was designed, we were a very young startup. This was a total prototype to show that our concepts work. It was never intended to be the solution that we flew our satellites on. But the reasons that this didn’t really work is we couldn’t meet the performance requirements for actually flying satellites. Queries of more than on the order of an hour, it would take an extremely long time. And then we had a performance bottleneck on our data ingestion of writing to the database, and we optimized that as much as we could, but it was still too slow to handle the volume of data that we needed at the rates that we needed.
Caleb Maclachlan: 00:24:54.120 And then we would start to see our performance degrade after only a few days’ worth of data. And of course, we need to store years’ worth of data. So that was not going to work. And we also didn’t have a - we didn’t really have customer specific data filtering, as I said, at the time, which is definitely a big need for us. And just like we realized that the solution as it was, there’s no way we could scale to having tens, hundreds of satellites with it. It just wasn’t possible. So that’s why we started to design a new architecture for storing our time series data and why we chose to start to bring Influx in. So this was what our initial design architecture was for how we would use Influx. The idea of being on orbit, we don’t have to have - we no longer have to have a local server physically attached to the spacecraft. As I said, our ground sites are in Norway and Antarctica right now, so the satellites would be talking to those antennas. And then that data would be sent back to our spacecraft control service, which as of this moment, our main one runs in Google Cloud. But we are able to run in other platforms as well. And then from that, we would publish data to InfluxDB Cloud 2.0, and then in Influx - we wanted to have all of our data in InfluxDB Cloud 2.0 and run it through some filter logic to put it into the different buckets for each customer and supplier. And then hook up an individual Grafana instance for each customer so that they could access their data just as easily as we could and in just as timely a way as we can. And then we would just give ourselves access to kind of that all data bucket, which is the main entry point so that we can just see everything as soon as it happens.
Caleb Maclachlan: 00:26:48.951 And then the idea being that for future satellites that are still in development, still in the cleanroom, that still have to have a local server that we would have that data being published to a local install of an Influx database and then used Kapacitor to push that data into our main cloud instance so that we could have that data archive long term and continuity of data when that satellite eventually launches. And I’m not showing our API or any of that other stuff on this diagram just for the sake of simplicity, but this was the general concept of especially how we’re handling that data filtering and what the general flow of time series data in our system would be. So we started to play around with this idea and started to have conversations with the InfluxData team. The InfluxData team was really helpful in understanding what we could and could not do with the architecture as it exists today. And so we ran into some challenges with the architecture that showed that meant that we couldn’t really implement it exactly the way we wanted to. One key piece was that subscriptions and streaming aren’t supported in Cloud 2.0 yet. I know that’s a feature the InfluxData team has been talking about and they’re figuring out how they’re going to handle that. But as the Cloud 2.0 product matures, it just isn’t there quite yet. So that kind of caused a problem for us because our data often comes in based on how and when we download it using those 10 minute windows, we’ll get late arriving data pretty frequently. So a like batched query doesn’t really do it for us. And as I said, most of our customers are going to want access to their data quickly. So we don’t really want to have a big delay in when we see the data to when they see the data. So that streaming capability of InfluxDB 1.x is something that was really important to our use case and it’s just not quite yet supported on the cloud.
Caleb Maclachlan: 00:29:03.033 And then another issue that we had in some of our testing was that the core language that you should be using to query 2.0 database is Flex. It’s faster. It’s just like it can do more things. It’s the native way to talk to 2.0 database. And we found that - but the good thing is there’s a compatibility layer into that lets you use InfluxQL , which is the previous query language for 1.x on a 2.x database. So that is what we notionally use and we have to use it because that’s the thing that the finite query builder supports for one thing. But sometimes we’ve run into performance issues or instability issues with that kind of transformation layer. And then also the last part that is kind of unrelated to the Influx architecture but it’s just a challenge that we face internally is that we have this nice API that is accessible through GraphQL, that you can traverse like a tree. And we needed a way to plug in our Influx data seamlessly into that so that it wasn’t telling people that, “Hey, if you want to access this kind of data, you go to this API. And if you want to access this kind of data, you have to go to this other API.” Right. So we didn’t have - there aren’t any good connector libraries that we can find that would let us hook up InfluxDB to GraphQL, because that’s not usually a use case that most people would use. So those are some of the initial challenges that we faced.
Caleb Maclachlan: 00:30:45.186 So we made some modifications to our architecture and this is what we came up with. So the key difference here is that we now host our own 1.8 Influx open source database in our Google Cloud, and our control service publishes measurements directly to that. So as soon as we get the measurements, they’re sent into 1.8, and then Kapacitor takes those subscribes to that, 1.8 does support subscriptions, subscribes to that and pushes data multiple times to the cloud. So we actually write data to the all data bucket, which we use for our displays to control the spacecraft. But then we also push data additionally to individual customer buckets, into our supplier buckets. So we’ve kind of worked around the fact that the cloud internally doesn’t yet support the kind of data filtering that we need by doing it on our side, in our cloud and then just pushing it multiple buckets which are then available to our customers. So that’s really just the key design change that we had to make was to just do that externally on our cloud rather than internally on the InfluxDB Cloud. So I’ve been talking about kind of the big piece of this was, which was the spacecraft control and data filtering. As I said, we also have this other need, which was the ability to monitor our software performance and make sure that it stays tuned, especially in production. So just to recap some of the goals, we wanted to be able to capture, like a piece of our process. We wanted to be able to capture how long that took. And what we found was first I was trying to measure performance on averages like for this particular operation, how long is it taking on average? But when you’re doing operations like thousands or tens of thousands of times, sometimes you’ll have, if it’s taking normally 2 milliseconds, but then sometimes takes 500 milliseconds. That’s huge data point that you really need to have and taking averages wasn’t really cutting it. So we had to take averages and max and standard deviation, and it was really hard to understand what each part of the code was doing.
Caleb Maclachlan: 00:33:10.262 So like the logging that we were trying to do wasn’t really working. And as I said, we wanted something that would be safe in production because it’s really hard to, in a dev environment, emulate the load that multiple satellites is going to put on a spacecraft command and control system. We needed it to be really quick and easy to implement. We didn’t want a bunch of extra code polluting our code as far as injecting a bunch of stuff for making everything uglier, more fragile, etc.. And then we needed to be able to handle thousands of operations per second. So what we did was design a simple python object that aggregates metrics and then just use a context manager associated with that times the actions and then created also methods so that we could push non-timed metrics such as like a queue length or something like that to this object. And the way that this object works is when you are measuring in action, it doesn’t immediately write that to Influx. It just aggregates it until it gets to kind of a certain size, and then when it’s got enough to justify a publication, there’s certain points in the code which are noted as like safe places to publish data to an Influx. So when it hits those points, those checkpoints, it will check like, do I have enough stuff to publish? If yes, then it’ll publish a big chunk of data to Influx. And then one of the nice things about this is that it is the whole system can be enabled or disabled with one anvar. So if we’re debugging an issue, we can just turn it on and suddenly we’ve got all these metrics about our system and how different parts of our code are performing. And then we can just flip that off. And then we’re not spending any time at all on doing those performance metrics. But in practice, it’s been performing enough that we’ve been able to just leave it on.
Caleb Maclachlan: 00:35:11.080 So that was our design and what we implemented. Now we get to the fun part where I get to actually like show some data. I love visualizing data. So this is my favorite part. Here’s an example Dashboard that just shows the kind of thing that we would be looking at as operators when the spacecraft comes in to pass. So, as I said, we get like 10 minutes to make decisions on what the spacecraft’s doing. And you can see here, Grafana gives us a huge amount of flexibility through all the plug ins that we use in showing the data that we need. So I think that 3D globe plug in is a pretty cool example. You can see that - it’s kind of hard to see with this small slide, but there’s a red line that goes across there that is actually coming from GPS data on our satellites. Our GPS is saying, “Here’s where I am, and we’re plotting that as it goes over our ground station in Norway.” So those are the kind of really cool things to have as an operator to see what is going on, as well as like these status maps and switch lists. These are just to be able to see at a glance the state of everything that we’re healthy is super powerful and is exactly the kind of displays that we need as spacecraft operators. Another example, Dashboard. One of my favorite things here is this panel called Attitude. It might be a little hard to decipher, but basically I’ve got a little box that represents my satellite, and I can see on that box three dimensional vectors. And three of those are just pointing out which axis is which. But then there’s two special vectors there. One is Nader. Nader means basically the ground. It’s so that purple vector is pointing towards the ground and then the yellow vector is my Sun Vector. So that’s telling me the direction that the sun is relative to the spacecraft. And as a spacecraft operator, I’ve never had something like this before as a tool to be able to see in real time or to move the time slider back and see it another point in time how the spacecraft is oriented.
Caleb Maclachlan: 00:37:36.965 So in this, at a glance, I can see this and say, “Okay, the nader vector is aligned with the plus x axis of the spacecraft.” I know that’s good because I know that plus X is where my antennas are. So I want those to be pointed towards the earth. And I also can see that the sun is mostly on my plus Y face of the spacecraft, which is where one of the two solar panels on the spacecraft are. So this thing is going to be charging, everything is good. I’m a happy operator and being able to see that in real time in my displays alongside the other telemetry that I get to look at is just huge. So this is working really well for us and making it a lot easier for us to safely operate the satellite. Another use case that I mentioned is alerting. There’s actually a lot of ways to do this and do this well. Influx has some native ones using the Cloud 2.0 projects where we could set up alerts there. We chose to go with the Grafana alerting partly because Grafana is already the go to interface for operators. So they know it. They use it a lot. And so putting our alerting in the same place that we are doing all of the rest of our telemetry monitoring, it just made sense. The visualizations are really useful. Like you can see - I just created like example here, but you can see where your limits are relative to what your telemetry is doing. And then there’s a really a lot of flexibility in how you’re designing the alerts. So I could be looking at an average, I could be looking at maximums, I could be doing some fancy fit or a prediction on the data, and then just alerting based on that, which is super cool.
Caleb Maclachlan: 00:39:22.132 And then one of the nice things is with our stack. For viewing, we normally point to are the Cloud 2.0 all data bucket. We have a dropdown actually at the top of all of our - you can actually see it on this one. We have a dropdown at the top of all of our dashboards for database, and that is just going to point to either our local being in our Google cloud 1.8 instance or it’s going to point to the Cloud 2.0 instance. So we always have the ability to switch to our 1.8 instance if we have encountered stability issues on that the InfluxQL API, which we haven’t since launched, thankfully. So we can only switch back and forth. And the nice part about this is we can also switch our learning back and forth as well. So if we have start to have an issue with the Cloud 2.0 side, we can just basically flip a switch and we’ll be back on our 1.8 and everything will be safe and secure, basically. One kind of interesting piece of our implementation that I think worked out really well is our federation. As I said, this is one of the challenges that we faced earlier on, which was we have this nice GraphQL API and there wasn’t really a native way to bring that in to Influx. So probably going to be a little hard to see, but on the left hand side of this, this is GraphQL is just a tool that is really easy to plug into any GraphQL API that lets you explore it and build queries for it with like autocomplete in those kind of features and then see the result in the right hand side.
Caleb Maclachlan: 00:41:17.878 So I made a simple query that is saying, “Hey, give me all of the telemetry items that match this particular path, and then from those, give me the type of them and then give me the most recent state that you have for the value and the time stamp.” So I created like a custom query. And the cool thing about this query is that a lot of what it’s accessing is still in our Postgres database because we have a lot of non-time series data, right. Like the identity of our satellites and what channels of telemetry they produce in the meta about them, the calibration factors and those kind of things. We don’t keep all of that Influx. We keep our time series data on Influx. And so this query when I ran it, it was querying from our Postgres database kind of the metadata about these particular points. But then it was pulling the actual values from Influx. And from a user perspective, it’s totally seamless. I don’t see that it’s coming from two databases. And I would have run this space enough. Effectively, this exact same query before we made the change Influx, but the difference is I’m seeing a 10x speed improvement on how long the query like this would take just because we’re not accessing these really big, bloated time series tables in Postgres, and instead we’re accessing Influx under the hood. So we were able to build something that looks like the rest of our API and supports Relay, which is kind of how - which is a GraphQL standard that lets you do things like pagination. So I can do pagination on my Influx data just using the same methods that I would on data coming out of Django. Of course, the main way we access our Influx data is by the using Grafana for visualizing it, but if I need to programmatically access it through the API, I have this as a tool which is really key, that we were able to build this without a huge amount of additional effort.
Caleb Maclachlan: 00:43:28.904 So again, going back to the performance monitoring side, that piece that we implemented to carefully track the performance of our code, I just wanted to give a quick example of what that looks like. This is a little bit simplified, but this is effectively how it looks for us. We create a stats publisher object and give it a measurement name so it knows where to go in Influx. We also can add some more options here to basically help it decide how frequently it needs to publish. But this is just like the vanilla. This is all that you would need to do. And then we just have a context manager, and each time you initialize a context manager, you’re going to give it a field name. So that’s also part of how you would then find that data in Influx later. And again, this will just like auto complete in Grafana, so it’s really easy to access the state after it’s in, do a couple of operations that might be expensive or you think they could take a while. And then exit your context manager and then it’s going to attach how long that operation took to the stats publisher object along with whatever else you run. And then when there’s a safe place in your code, when you’re not in the middle of something critical, you can call the publish_if_ready() method. The publish_if_ready() then checks and sees if that stats publisher has aggregated enough data to hit that threshold to where it’s like, “Okay, I’m ready to publish all this data and if it hasn’t, then I just won’t do anything.” But if it has, then it’ll send that data to an Influx. One of the things we also always measure with this is how long the steps internal to that stats publisher are taking so that we can make sure that we’re not spending too much time, that the thing that is trying to help us speed up is not actually slowing us down. And so far it’s been the actual rights to Influx are pretty performant. So that hasn’t really held us back.
Caleb Maclachlan: 00:45:26.350 So here’s some examples of what this looks like to us in practice, how we actually use this data. So there were a couple - there were a couple items that I have here, the right telemetry to Postgres and push telemetry item. And you can kind of see - in this Hendel summary packet breakdown, you can see that the right telemetry item to Postgres is usually taking the majority of the time. That other one was not taking as much time, and then I can visualize this data in a bunch of different ways because I have the raw data of every single time that operation was executed. I can do averages. I can do basically whatever I need to understand where performance bottlenecks might be. And also this gives me a great view. If I do a software release, I can see, “Oh, now we’re taking double as much time to do this particular operation.” And then I know what went wrong and can go back and try and fix it and debug it. Or the ideal is I’m seeing that after we made this change, that my overall time for packet handling is down 25%. That’s the kind of change that we’re looking for, and this lets us have that big heritage of data of actual production usage to compare any future changes, which is super helpful to a software engineer. And I realize I’m starting to run out of time here, so I’ll be pretty quick with the challenges and the next steps. Most of these I’ve already touched on. For the no-subscriptions, we’re keeping a 1.8 locally or in the cloud, and then using Kapacitor to push multiple buckets. As I said, we ran into some InfluxQL stability issues which were just keeping the 1.8X as a fail back that we can use if we need to.
Caleb Maclachlan: 00:47:19.986 And then another - it’s kind of a small thing, but one that other people may encounter, too, is right now off tokens are tied to a user account in cloud 2.0. So a bunch of us have user accounts to administer the Cloud 2.0 instance, but if we need to create a token for a customer, we need to do that. We need to have a special admin account that we have a shared password for that we can log in and do that way because otherwise the token that we’re creating customer would be tied to one of our specific user accounts. Another issue that we’re still kind of in the process of solving is that in Kapacitor, you can’t filter fields below the measurement level. You can’t say, “Hey, I just want these particular fields of this measurement.” And to do that, I think we reached out to the Influx support team and they suggested that we use Telegraf, which we can use wild cards in the drop fields thing for that, and that should let us do the filtering we need. So we’re actually switching to - we’re looking to switch to Telegraf for some of the filtering that we’re doing rather than doing everything in Kapacitor. But the big takeaway here is to talk to the InfluxData team and leverage the community for trying to improve your product. Don’t go it alone. There’s a lot of help out there. We definitely benefited from it as we developed it.
Caleb Maclachlan: 00:48:40.711 So what’s next for us? Big one, obviously, is we’re growing fast. We’re going to need to scale this up and keep pushing our performance and ingesting a lot more data. We do still want to go towards our original design, where we can replace the 1.x instance that we have in our cloud with doing everything in the Influx 2.0 cloud. So that’s still our goal, and we’re going to continue to work with the Influx team when that gets available. We would also like to switch to Flux instead of InfluxQL, but that means we need to wait for InfluxData, Grafana or someone else in the community to build a query builder for the Flux language. So we’re hoping that - we don’t have the time to make that happen ourselves right now or we probably would, but we’re hoping that that happens because that would unblock us from using flux, which is what we really want to do. So that’s the content that I have for today. If this sounds interesting or exciting, the good news is we are scaling up our company and starting to hire like crazy right now. We have a career site, but also you’re welcome to just email me your resume directly, and I can take a look at it and see if there’s a good fit. And as Caitlin said, I’m also active on the InfluxDB community slack. So you’re welcome to send me a message there or send your resume there or whatever. I’d love to hear from you. This is a really exciting time to be a part of this company. So I’m definitely hoping that this talk might tempt someone to send us a resume. And then, Caitlin, you wanted to talk about InfluxDays?
Caitlin Croft: 00:50:32.658 Yes. Thank you, Caleb. So just like I mentioned at the beginning of the webinar, we have InfluxDays, North America coming up. So on October 11th and 12th, we have the hands on Flux training. So it’s run by a bunch of data scientists, professors in Italy. I think this will be the fourth or fifth time they’re giving it. It’s a great course. They tweak it every time, and of course, because they’re Italian, the use cases’ about at IoT sensors on a pizza oven. So it’s kind of funny, you get to talk about melting cheese and all that. And then we have - it’s not listed here, but on October 25th, the day before the conference, we have our popular Telegraf training. So if you’re pretty new to Influx and you’re just starting to get familiar with InfluxDB Cloud, it’s really great free course. There is a fee attached to the Flux training just because we want to make sure people really enjoy it and to cover the cost of the trainers. And then the conference itself is on October 26 to 27. It is completely fr
ee. It’s two-half days virtual. There will be lots of sessions from our engineers as well as amazing community members. So it’s definitely worth checking out. All the sessions will be recorded and available, but we’d love to see you all there. Caleb, I think you kind of stumped everyone. Everyone is probably just as impressed as I was. So we’ll give everyone just a couple of minutes, if you have any questions for Caleb, please post them in the Q&A. I have a couple of questions. Caleb, you mentioned that you’ve used InfluxDB at a couple of different companies now. What are some tips that you could share for someone who is pretty new to Influx? What are some things that you’ve learned along the way? Tips or tricks that you’d like to share for the initial implementation?
Caleb Maclachlan: 00:52:36.393 Yeah, for sure. I think it’s really easy to get started with Influx, but you want to be careful - you want to think carefully about how you structure your data. You’ll hear series cardinality thrown around a lot when people run into issues with how they’ve structured their data. That’s definitely something that before you dive headfirst, I would try to understand that concept of how - because a lot of what you use is tags, and you don’t want to have tons and tons of totally unique tags. Like if we were to, for example, for us, we could theoretically tag each piece of data with the unique contact ID that we’re using each time we talk to our spacecraft. That is not a good idea because we’re going to have way too many tags and we have to index across all of those tags. And so our series cardinality goes up like crazy, which is what really kills the performance of an Influx database. So that’s like the number one thing I would watch out for when building starting to bring an Influx database is the series cardinality. Just make sure you understand that and you’ll be okay.
Caitlin Croft: 00:53:50.630 Yes, I definitely heard that before. People get a little stumped around cardinality, so be sure - if you have any questions around that, we have lots of blogs around it, and I know a lot of people talk about cardinality in the community. So be sure to check that out. Let’s see what else. So when you were implementing it, you had mentioned you and someone else at Loft had used InfluxDB prior, was it hard to get adoption across the organization or was it pretty straightforward?
Caleb Maclachlan: 00:54:27.288 I think at first, I think there was some skepticism of like, should we like - is this something we really need to do, etc.? And kind of the reality is just showing that Grafana query builder and the immediate easy access to the data that it gave was enough added value to just have like instant impact to like silence any questions of is this something we really want to do? Right. So it was easier to just spend a day building it and then be like, “Look,” rather than to actually try to have long conversations about is it really valuable, etc.. because once someone actually looked at what they can do, the value was so apparent that there was no question, right. So I think that was kind of the key for adoption. And that was also in a previous company and similar deal where there was all this like, “Oh, well, should we really do that? It’s going to be all this change.” And then you show that and it’s just like silence, everyone loves it. People are going off and building things that I could never have imagined. People are just like go figure out how to build it. So, yeah, it’s really cool like that.
Caitlin Croft: 00:55:39.137 That’s awesome. And it sounds like you were pretty familiar and comfortable with Time series data, but after using it at Loft, were there any things that you discovered that weren’t expected?
Caleb Maclachlan: 00:55:56.149 I think one of the things that was a little slightly different about how we did it here is we have really big measurements. Like some of our measurements have well over 1,000 fields in them, and I wasn’t sure how that was going to work. But a lot of the logic is that those fields generally come down together in one package. So those 1,000 fields are always together. And so we weren’t sure if that would be the best approach to just have these giant measurements, but we found that it worked really well. So that was just an interesting thing to learn, which is kind of Influx-specific, I guess. But that was an interesting thing to learn, was that that was a viable approach and one that seems to have worked quite well.
Caitlin Croft: 00:56:43.645 Cool, and another question I had, downsampling. Usually when you look at downsampling time series data, when you start off, you’re collecting it, let’s say every millisecond, every second, whatever the case may be. But after a while, you can downsample your data and maybe realize you don’t need to be collecting your data at that kind of granularity. Do you think that will be the case with Loft or do you think you always want it pretty granular, given different missions, different projects?
Caleb Maclachlan: 00:57:17.483 Yeah, I haven’t been part of a space company yet where we have been brave enough to throw away data because like these things are so expensive. And so we were always going to be so scared that, like, we throw away the one most important measurement. But we have invented something that we’ve talked about. As we scale, I think we will eventually need to. And one thing that I think we like is that we could do it in Flux or in InfluxQL. We can copy data to another bucket that is downsampled. And what we would probably do for our particular use case is do at least three down samplings on each series. So that would be probably - maybe let’s say that it normally comes in at one hertz once every second. Maybe we bucket it to every 30 seconds and we’ll take the average, the minimum and the maximum, and are all three of those a separate series. So we would still with that, we wouldn’t really lose the most important parts of our data, which is like, “Hey, what is the hottest that this thing has ever gotten?” I still will know the maximum absolute measurement that I’ve ever gotten on that sensor, but I can still get a 10x reduction in the total data volume that we’re storing. Now, the nice thing is we don’t have to do that because Grafana and Influx play really well together in terms of being able to on the fly display a downsampled version of the data. So I can zoom out and see a year’s worth of data seamlessly because under the hood it’s already like in the query downsampling into a manageable amount of data to display. So that’s kind of the strategy.
Caitlin Croft: 00:59:02.890 Okay, yeah, that makes sense. I was sort of curious just because, yeah, downsampling this kind of data doesn’t exactly seem like something you guys would want to do. My one final question. I could probably ask you questions all day. What about anomaly detection? I’m sure those are probably the most important. I know you kind of covered it a little bit obviously with all the alerts that you guys have set up. But just kind of curious if there’s anything specific that you learned along the way with setting up anomaly detection or any interesting anomalies that you discovered along the way?
Caleb Maclachlan: 00:59:41.027 Yeah, I think that that’s a big kind of greenfield area for us, that we’re going to be spending a lot of time investigating as we launch more satellites - excuse me - is automatically detecting more advanced anomalies using ML kind of techniques, that are incorporating multiple points together rather than just monitoring one channel. There’s some space companies that have done some really interesting things around that. We haven’t gone to the level of that complexity yet, but I think that’s something that we’re really excited about as we grow our company to get more into.
Caitlin Croft: 01:00:16.799 Awesome. Well, thank you, Caleb, so much. It’s always fun chatting with you. Like I said before, I’m a little bit of a space nerd, so these sort of these webinars are always really fun for me. Thank you, everyone, for joining today’s webinar. Once again, it will be made available for replay later today as well as the slides will be made available for review. And yes, if you’re looking for a job, reach out to Caleb. I’m sure it’s a fun company to work for. Thank you, everyone, and I hope you have a good day.
Caleb Maclachlan: 01:00:50.844 All right. Thanks, everyone.
[/et_pb_toggle]
Caleb MacLachlan
Senior Spacecraft Operations Software Engineer, Loft Orbital
Caleb is a space junkie and a veteran of five space startups. His expertise is in designing and building high performance command and control systems for labs, satellites, and launch vehicles, primarily in Python. He has been using InfluxDB as a key building block for these systems for the last few years. Caleb is a graduate of the Astronautics program at the United States Air Force Academy.
Caleb lives in San Francisco, and enjoys snowboarding and sea kayaking when off duty.