Coming soon! Our webinar just ended. Check back soon to watch the video.
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
Webinar Date: 2021-09-28 08:00:00 (Pacific Time)
European XFEL are the creators of the strongest x-ray beam in the world. Their 3.4-km long X-ray free-electron laser underground tunnel is used by researchers from around the world. Scientists use their facilities to map atomic details of viruses, film chemical reactions, and study the processes in the interior of planets. Discover how European XFEL uses InfluxDB to monitor their scientific experiments and research.
In this webinar, Alessandro Silenzi will dive into:
- European XFEL’s approach to empowering the worldwide community to push the boundaries of science
- The evolution of their data management solution — from homegrown to InfluxDB
- How a time series platform is used to analyze and validate experiment data
Watch the Webinar
Watch the webinar “How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Dr. Alessandro Silenzi: Team Leader, Controls Development Team, European XFEL GmbH
Caitlin Croft: 00:00:00.493 All right. I think we will get started here. Hello, everyone, and welcome again to today’s webinar. My name is Caitlin Croft. I work here at InfluxData, and I’m very excited to have you joining us today for a webinar about how European XFEL, who have created a particle accelerator, are using InfluxDB to monitor their scientific experiments. Once again, please feel free to post any questions you may have for Alessandro in the chat or the Q&A. And this session is being recorded. So without further ado, I’m going to hand things off to Alessandro.
Dr. Alessandro Silenzi: 00:00:41.134 Hi. So my name is Alessandro Silenzi, and I work at the European XFEL as team leader of the development team of the Control group of the European XFEL. The title has been chosen by Caitlin, and I actually like it a lot, although we are not a particle accelerator, but we do monitor scientific experiments using InfluxDB. So this talk will talk about X-rays in general, and more quickly, about the European XFEL and the control system that we developed for it, which is called Karabo, and our data logging system, as well as how we transition that to use InfluxDB. After having done this, we will talk a bit about the implementation details — because I understood that you are interested in exactly that — go through some of the design choices that we made and also share some gotchas that might be useful to some of you, and also, at the end, as tradition, an outlook of what we want in the future.
Dr. Alessandro Silenzi: 00:01:57.114 So X-rays. So what are X-rays? X-rays, essentially, are electromagnetic radiation. Your dentist has probably something like this. This picture on the right is very close to the first actual X-ray slice of [inaudible] that has ever been done by Conrad Röntgen 125 years ago. Frau Röntgen was apparently a very trusting soul. But what your dentists have is actually very close to what we have, which is there is a small electron gun which is a tiny particle accelerator, so a 50 kilovolt of voltage which will net around that energy on electrons. There is a way to stop the electrons to make them generate X-ray radiation, a vacuum system — all this X-ray tube is in a vacuum — a sample, which in this case is a hand, and a detector, which, in this case, is a film.
Dr. Alessandro Silenzi: 00:03:05.774 What do we do with that scientifically? As in, we can see here two, I think, historically meaningful examples of X-rays. One is Photo 51, which is an X-ray diffraction picture of a paracrystalline DNA sample, and that helped — this particular picture that helped confirm the theory that the DNA has a double-helix. And on the right, you see the CheMin instrument, which is Chemistry and Mineralogy, if I’m not mistaken, which is a powder diffraction, X-ray powder diffraction instrument which is currently housed in the Curiosity Rover. So we are doing X-rays on another planet now. But of course, we have those technologies. We are even doing it on another planet. Why do we need an XFEL? And what is the European XFEL?
Dr. Alessandro Silenzi: 00:04:16.597 The European XFEL is a research facility which is funded by 12 countries. All of them you can see here on the top-left. A nice family picture. If you actually zoom in, we are located in the city of Hamburg in North Germany, and it’s actually on the border between the German federal state city of Hamburg and the German federal state of Schleswig-Holstein. You can find us on OpenStreetMap.org, and actually, scaringly accurately, you can even see the tunnels in that website. We’re operating since September 2017 as a user operation, but before that, of course, we had some months of commissioning.
Dr. Alessandro Silenzi: 00:05:14.176 How do we generate X-rays at the European XFEL? The European XFEL, by the way, means European X-ray Free Electron Laser. And you have, still, all the elements that you saw on the first slide. So you have an electron gun, which in this case an electron accelerator which starts from DESY and goes toward us. DESY is our partner institute, and which stands for Deutsches Elektronen-Synchrotron. So it’s a campus that — it’s a campus really nearby, so a few kilometers away, and it has a LINAC which accelerates electrons through 14 to 17 GeV and generates electrons through a system of undulated feature we’ll talk through to you very soon.
Dr. Alessandro Silenzi: 00:06:20.316 How do we make X-rays instead of ramming all the electrons against a target of tungsten like the one in the first slide? We are letting the electrons — we guide the electrons — or after accelerating them, of course, we guide them through an undulator system which has alternating magnetic fields. And this makes the electron wiggle, and which will generate spontaneously radiation from that. The fact that the trajectory is curved makes the electrons actually accelerate, and acceleration and deceleration go from there. All these relativity reasons are similar, and they can generate radiation.
Dr. Alessandro Silenzi: 00:07:14.179 If this trajectory is particularly tuned, you can generate a phenomenon called SASE, which is Self-Amplified Spontaneous Emission, and this will — it’s a process that builds intense, laser-like flashes. This would mean that all the electrons can, sort of, seed their energy into a very powerful X-ray beam which looks like a laser. And it actually looks like a laser if you pass it through a four-blade slit. This happened before our user operations on the 30th of June 2017. This particular picture is from then, and you can see the Fresnel pattern. And this is a sort of a proof that we do have a coherent light.
Dr. Alessandro Silenzi: 00:08:12.416 So going back to our initial system, we have an electron beam which is led through a target that make it generate X-rays, and you see that in these striped sections in our tunnel. The electrons can be, of course, through magnetic means, tuned left and right and guided through the tunnels. And once the X-ray radiation has been blue and the photon path is in orange, at that point, the light has been tuned and can be guided through with an optical system in a vacuum. The list of the instruments that are commissioned and in operation at the European — housed in the same experimental hall, and they also are targeting different types of physics. Physics or biology. Okay.
Dr. Alessandro Silenzi: 00:09:25.236 So on the top is — yeah, it’s a type of instrument that I mentioned, so the one using crystallography. Actually, I will go back to that. I will not go to the details of all the different physics for a separate reason, one of them is it’s not my forte. But just as an example of how the environment looks like in experimental hutch is essentially this jumble of wires and equipment. And you can see more or less the size of the instrument and with the people inside it. So you can more or less see the scale. Going back to crystallography, we have a beam, a photon beam which contains X-rays, which is made by X-rays, is driven, and it needs to be aligned. And that we do with particular mirrors, a mirror system which can only reflect at grazing angles. I’ll dig deeper on that. But you essentially need to align the beam and the sample delivery. So we have a sample. So not the hands of Frau Röntgen, but that.
Dr. Alessandro Silenzi: 00:10:41.101 But in the case of serial crystallography, it’s a jet of fluid with tiny crystals of interesting biological molecules as you can see here in this detailed view. This can also be excited with some optical laser light because that depends on the — how do you want to do the measure? You should measure the intensity of the X-ray beam, monitor the environment temperatures, the vacuum stage, the vacuum gauges, and also, you need to configure the detectors, right? So we don’t have photography film, but we have silicone-based detectors that are read out at 10 hertz as train speed and can acquire a burst of up to several hundred — up to 200 images per train. You need to start up the data position. You need to synchronize all this thing to work, and also, you also need to monitor data quality. How do you do that? How we chose to do that in the European XFEL is by developing our own control system which is called Karabo. Karabo is a word of some southern African languages, all the Sepedi, Sesotho languages which means “The answer”. It’s a very optimistic name. It’s a control system which is actor-model based and exchanges messages through a central broker for control and slow data and for topological changes.
Dr. Alessandro Silenzi: 00:12:40.142 Topological changes, currently the broker implements this OpenMQ, which, for those who don’t know, is the Java messaging system, so the free version of the Java messaging system, and soon will be interchangeable with other protocols: MQTT, AMQP with RabbitMQ, of course, and Redis. It’s event-driven. Data propagates through the system only when value changes. So think of pushing, not polling. Of course, there is a qualifying asterisk, which is you, of course, would love to poll any time you want. It’s message-driven, so we borrowed the QT signal-slot paradigm, and it is an asynchronous call with synchronous convenience middleware. Its main APIs are in Python and C++. C++, of course, relies on Boost heavily. On top of this slow data, we have peer-to-peer connections where there are several topologies allowed: scatter and gather, copy and distribute, and the load, and you can wait or drop on traffic congestion or consumer congestion. The peer-to-peer connection is based on using TCP as the transport layer, and it’s used the same between the GUI server and the GUI client, and more on that later. It’s completely capable. It’s fast enough to saturate a 10-gigabit client. And one key element of the system is the so-called GUI server, which essentially allows interaction with the system through one single TCP board, which is good.
Dr. Alessandro Silenzi: 00:14:37.016 Where dynamic and discoverable are topologies, there is no central database instance that defines what should start where, and the sort of how to discovery will come in detail specifically when we will talk about the requirements on the logging database. So what you can see on the right is essentially a schema of more or — a scheme of how everything comes together. So the data logging in our case, which monitors performance and diagnostic directly connects to the broker. So we do not directly connect to larger or fast data like cameras and large acquisition equipment.
Dr. Alessandro Silenzi: 00:15:34.300 More in detail. So you can see a little bit more in detail on Karabo. You can see here is the view of our Karabo GUI client where you can see the topology, which is, again, self-described by the GUI server which is served to the GUI client. As I mentioned before, there is an actor model where you have one thing that defines one service. So in Karabo, this actor is called the device. And ideally, a device provides one service, for example, interact with the hardware. In this case of the picture, you see, for example, one device which is named motor — you can imagine what it does — and it’s addressable through its device ID. In this case, there’s an SA1_XTD9, etc. And it’s isolated in a broker topic. This topic name comes actually — is a nomenclature from OpenMQ, and that means essentially, imagine a chatroom where all the messages are isolated and shared. You can see this topic name on the bottom-right where you define that SA1. SA1 is the denominated SASE one.
Dr. Alessandro Silenzi: 00:17:09.495 Each device provides a schema which is a self-description of the format of its configuration and its finite state machine. So, for example, each slot — it’s a command which is visible here in the GUI in the detailed view — will declare in which state of the device it’s allowed to be issued. The same is true, for example, for some property. The schema contains a self-description like which type of value is contained in a specific key, the default value this key should have, and also some helper tools like warning levels and units, and aliases for some definition interfacing on some hardware level. It can generate configuration on either request or value update from the hardware, for example, and these updates are sent to the broker through a tool called the subscriber. So what you see there, for example, is a mixed view of schema and configuration data, so something is not supposed to change that often like the type. And some data changes continuously like the value itself, and its timestamp, and its trainID which is an event identifier which increases.
Dr. Alessandro Silenzi: 00:18:52.643 One feature that was implemented long ago in Karabo was the ability to retrieve, on user request, the history of a specific property. We could actually retrieve the full configuration of the device, but this particular feature, the retrieval of a single property, is very useful. So how did we do that? So now you can see here a nice trendline in our QT GUI client, and you see that the property is the actual position of this specific motor. This motor, by the way, moves — is the actual motor that can move one of the main distribution mirror of the [inaudible]. How do we do that? We do that through a device which is called — a system of devices which is called data loggers. And these continuously are connected to the broker and listen to changes in the broker data. It’s done by default. All devices will be listened to unless, of course, some flag is specified that the device should not be listened to. And it’s an internal data product which is mainly addressed for maintenance and performance monitoring. This is in contrast, of course, to the data acquisition, which is intentional. So it’s explicitly started. There’s a run base, and the user — lets user — we use the term “user”, but I’ll come back to that. And a system user needs to define which type of data it needs to acquire.
Dr. Alessandro Silenzi: 00:20:40.508 This, of course, will include large and fast data, data like cameras or X-ray detectors, or whatever goes in peer-to-peer connections. And it also can include only a subselection of slow data. This data that is collected is then offered as a data product to the facility users. How did we do the file logging? Now, since Karabo is a scatter system, I realize that the nomenclature of this product is actually an operational historian while doing research for this presentation. And this was introduced in 2014 as a temporary solution. So there, the warning triangle, so watch out for temporary solutions because they might stay around for a while, and that was true. So it stayed around until 2020 when we substituted with Influx, and this talk is about that. It uses still because it’s still available for smaller systems. It uses an ASCII backend which has, of course, limits on scaling and performances and yeah, size. All that good stuff.
Dr. Alessandro Silenzi: 00:22:05.078 Through the Karabo, though, we were able to — through the Karabo GUI though, we were able to retrieve trendlines and configuration, which is great. So this is an example of data that is written out since — yeah, so one of the last data that we collected for this model. And you see here, for example, the — yeah, the data where some command, the command “stepDown” has been issued, and then the actual position actually changes. What is the format in this ASCII file? Now, this is going a bit into detail. So we have a pipe-separated line where we have a timestamp defined in ISO 8601, then a timestamp in seconds, UTC, of course, trainID, which is, again, this progressive accelerator [inaudible] ID, the name of the property itself, then the type, which in this case is floating-point, “FLOAT”, all uppercases, “FLOAT” — in this case, it means a 32-bit floating-point — and a user field. Namely, this is sort of an internal messaging key, an internal messaging property.
Dr. Alessandro Silenzi: 00:23:31.421 So this gets acquired, gets written out, and there is a separate mechanism that will parse the file of the historical data, downsample it, and send the data to the requester via the broker. So we are still essentially sharing historical views with — the historical view of the system is sharing the same communication infrastructure as data. So actually, things that are transferring, so like stopping commands, current changes, and so on. This, of course, is a warning triangle. Of course, we didn’t like that. We didn’t like the idea that we were using an ASCII file that had clear limits, and we looked for a replacement. There had to be several requirements. It had to be schema-less because the Karabo system topology is dynamic. They should be schema evolution tolerant so a Karabo device’s schema can evolve. So you could have a property that for some unfortunate reason is defined as an integer and in the future is defined as a string, or vice versa. And you could have, for example, you try a different change. You can try a different version of your device, then something doesn’t work and you want to roll back. So you can even have all the possible interaction stages.
Dr. Alessandro Silenzi: 00:25:02.815 This replacement needs to handle sparse records. So Karabo, again, is an event-driven system. Only changed data is sent, so only stuff that is really — not the full configuration is sent of every single change. And we need to be able to retrieve the history of a single property, ideally. Yeah, this was available in the file basic log-in system with some internal machinery to do indexing. You can imagine all the fun we had. [laughter] And also, we should be able to retrieve the full configuration of the device, which is also a feature of that file-base that logging system was for. We, let’s say, looked into time-series databases as a feature because they all look like a good solution, and we needed something that was scalable and possibly open source because Karabo will be released as an open-source product, and ideally — and Karabo scales down nicely. So it can serve a full facility like the European XFEL. But it can serve also small lab systems, which is, of course, a positive feature to have.
Dr. Alessandro Silenzi: 00:26:31.202 We considered several databases: InfluxDB, Prometheus, Timescale, and also some homebrew SQL solutions briefly. But very quickly, we honed in on InfluxDB. It was meeting all of the requirements, it was widely accepted, it was well-documented and easily integratable, and was also open for expansion. So they ended up collecting the now currently 250 billion metrics in our InfluxDB nodes, and there’s an increase per month of 10 billion. But before we get to that point of, how did we get there, we had the prototype in 2018 that was actually offered by a colleague from an instrument. This proof of concept was available as a replacement, for the data logger was not really catching all the corner cases that we like. And based on this concept though, we put up a protocol system where we were offering the first view to a limited set of user groups — to limited user groups as beta testers. The data was essentially migrated from the ASCII file to InfluxDB hourly, and a small Grafana installation has been made available to them to browse the data.
Dr. Alessandro Silenzi: 00:27:54.553 This response for that system was positive. This, for example, I mentioned the vacuum system of the control system. This, for example, is one case in which long-term trends are very useful to look at, and they were able to monitor long-term behavior and also help with some preventative maintenance during the winter shutdown. This prototype, of course, was not sufficient to scale up to the full facility and was intentionally reducing scope, and of course, the data availability was there. So with this sort of test under our experience, we came up with — so we chose to have an enterprise solution support with InfluxData where this is more or less the topology that we have. You have devices on the left that connect to the broker and the data logger is listening to it. Then there is a Telegraf node which acts as a load balancer and outputs data to an InfluxDB open-source node which is connected to a Grafana instance and an InfluxDB cluster which is an enterprise solution.
Dr. Alessandro Silenzi: 00:29:15.322 The Grafana system, of course, will allow the users to — so our internal facility to browse historical data without by just communicating between Grafana and the free and open-source node while the backward compatibility is kept in our system where the data log leader, which is this mechanism that retrieves historical data, will directly fetch data from the InfluxDB cluster node using InfluxQL queries.
Dr. Alessandro Silenzi: 00:29:59.914 More. Now we are going in the deep technical of our implementation. So we have, let’s say, Karabo twin flexibility translation dictionary, and you see, for example, that the broker topic now matches in InfluxDB the database names. So every topic gets a single database. And the reason for this is that we reinforce device ID uniqueness per topic. So we wanted to avoid the conflict on the backend. And on top of that, there would be some — so there is some user segregation — sorry, user isolation to be forced. Every device which is our actor in the actor model correlates with three measurements, so three, say, traditional database tables. So three measurements in InfluxDB, and one contains the configuration. One property name gets a field name, one measurement for the schema, which is saved in a Base64 encoded string, and one for the event where we define events that — like when the device is instantiated, it is shut down, gets its schema updated, and so on. I don’t know if that link is available to you. I sure hope so.
Dr. Alessandro Silenzi: 00:31:34.460 More in detail on that. Well, actually, let me roll back one slide if I can. Yeah, I can. For example, if you look at the property called propertyName, we’ll get a field name called propertyName-TYPENAME. So in the example I gave you a few slides ago where there is an actualPosition with the floating-point, so with the float type, it means that the property actualPosition is matched with the actualPosition minus uppercase “FLOAT”. And this saves us from schema conflict because if a type changes, then the field key changes.
Dr. Alessandro Silenzi: 00:32:22.540 How do we translate the types? Because as we all know, I guess, as InfluxDB experts, there are four types that we can save in InfluxDB: the Boolean, floating-points, integers, and strings. So Boolean, guess what, is Boolean. Float and double get both serialized through floating-point. All integers become an integer. The one case where we are doing something a bit fishy is the unsigned int64 which does not match exactly a type, but we interpret it as an int64. This is a bit of a stretch, but let’s say in most of the cases, it’s fine. We just know that, for example, applying a mean could be not reliable on that property because when you query a mean on the database, you will get the mean of the integer version of it. Vectors of integers, vector of something, they all become a string, so the usual comma separated. A string becomes a string with some mangling escape characters and commas. A lot of escape commas, by the way, now that I think about it. Vector-hash is another special type as well as the vector string. We chose to simply convert it to the base — to keep it as a binary content encoded using Base64 algorithm and save it as a string. This saves us from mangling and all the use cases, all the quotes, and all that jazz.
Dr. Alessandro Silenzi: 00:34:22.505 We did some design choices, and I’d like to share some gotchas in this slide. So one hidden requirement of our old file-based system was the fact that — so one main detail is our timestamps are tagged on the device side that helps, for example, subtract out transfer time of information on the broker. And our file-based log-in system was actually preserving the insertion of. We could have kept track of this, of course, with some way, but essentially, we decided to drop this requirement. It’s fine. Keep the timestamp only. That would mean two times less metrics actually saved on the database. And the same thing was true for this event marker, the trainID which, in the file-based system is recorded per property. Now it’s recorded per device, which means that we have essentially roughly two times less metrics in the system. So this is sort of the design choice, but it was sort of hidden. And it was fine, and it sort of hit us only when you have some corner case when you need to diagnose. There was just very few corner cases while this was — someone was missing.
Dr. Alessandro Silenzi: 00:36:01.556 Lesson learned; please, if you implement something similar, keep in mind to add a new filtering to your insertion mechanism. What we have in our system, at least, is that the timestamp comes always from the insertion. One detail I forgot to mention is we insert using the line protocol of Influx, and the timestamp comes from the device itself. Since we are relying on the device to have a sensible timestamp, what we exposed ourselves without knowing is that we had some device emitting data wrongly far ahead in the future, and this was essentially causing the denial of service to the data flow because this was continuously moving data across shard boundaries.
Dr. Alessandro Silenzi: 00:36:57.358 So we have the device that had an error, or had a bug, and was inserting data two weeks in the future. And of course, Telegraf was busy figuring out how to close an open shard, and that was sort of — we backed up the system. It was very difficult to figure out that was the case, and ultimately, we ended up sort of patching the system. Okay, now we have a filtering system that allows us to do that. We need to add a throttling property to this. And one thing that we found very useful, at least for — it’s a wealth of data that we had that we needed to migrate from file-based systems to InfluxData. So if you have this, I suggest you implement a speed tuning feature in your migration tool. This will allow you to, sort of, see in more detail the limits of your system.
Dr. Alessandro Silenzi: 00:38:05.299 Okay. So what did we do with this system? We essentially have changed backend. The user, for the first few months, was barely able to notice the difference because it was using Karabo as its main — so the control system as its main interface to the backend. But the development was possible now through using Grafana as a secondary monitoring system and a historical diagnosis system using that. And this came at the beginning of this year when we moved to have a data operation center where engineers and, okay, reformed physicists like me could do a full-time shift of support where we were essentially supporting the experimental — the experiments in the instruments. And all this was mainly performed using a Grafana panel, which was a new skill we learned in the data department which my group is a member of. And we built several panels of that. There are more examples of this in my backup slides. So you should be able to see that when they go online. And these are built using Flux and InfluxQL. Flux, of course, is a more powerful solution but slightly more difficult to learn.
Dr. Alessandro Silenzi: 00:39:56.454 That was a positive change. It really made my life easier having a tool that is more powerful to diagnose historical data. And what do we want to do in the future? Currently, in production, we have 1.8 as a version of InfluxDB. We want to upgrade to 2.0.X, so depending on which version we will migrate. We actually have it in test to evaluate the time series index as a possible choice. And we need to collect user statistics and identify possible bottlenecks. Of course, this specific task is never done. One thing that I value as very important is exploring downsampling algorithm on display. For example, the Largest-Triangle-Three-Buckets algorithm. So one thing we went away from file-based to Influx as a backend is that we had — when you were requiring a trendline, we were essentially downsampling with a random meter extracting historical data. That, of course, led to the usability problem that triggered the inner scientist of our colleagues that if you request the same period twice, you might get different data. That, of course, was not the best user experience, and we got a lot of complaints for that.
Dr. Alessandro Silenzi: 00:41:47.080 So now we are downsampling using the mean, so the average, which is better because at least we get a consistent view of the historical data. But one thing that we lose by doing that is that we do not preserve features. So ideally, we should look into different downsampling algorithms. And if you have ideas, please bring them all to me. We need, of course, to downsample the data that we took past our internal retention policy which is three years. And since we started in — let’s say the data retention policy has been started, so this system started to inject data since the 1st of January of 2020. So we have to quickly get to that point. And one thing that will be useful is how to migrate this process data, this slow data in InfluxDB, so, of course, slow data, into the scientific data that we offer to the user as our main data product. On top of that, we will probably enable user log-in and authentication, and how that is handled in Influx in a GDPR compliant way would be interesting.
Dr. Alessandro Silenzi: 00:43:21.425 On top of that, of course, you saw on the previous slide how we monitor the facility behavior using Grafana. One alternative that would be, of course, very interesting is using Kapacitor as a sort of continuous monitoring system and alert system. We could, in principle, directly pre-alert some experts provided some conditional triggers. Okay. So this sort of closes the frontal version of this. I would like to thank you for your attention, but before doing that, I really want to stress that what I presented here would not be possible without the support from the group, my [inaudible] Control, and the Control NITDM group at the European XFEL, as well as the contribution from all over the European XFEL. If you want more information of the European XFEL, visit our website, XFEL.eu. And before going to the questions, I think there is one reminder that —
Caitlin Croft: 00:44:37.380 Yes.
Dr. Alessandro Silenzi: 00:44:39.628 So Caitlin, you want to take it away from here?
Caitlin Croft: 00:44:41.868 Yes. Thank you, Alessandro. That was fantastic. So InfluxDays is coming up in October. I can’t believe that it’s already that time of the year again. We have the Hands-On Flux Training on October 11th and 12th. I think there’s just a couple of seats left, so if you’re interested in that, please be sure to sign up. And then on October 26th to 27th is the actual conference. The conference is completely free. So it doesn’t matter where you are in the world; we’d love to see you there. So super excited to have that coming.
Caitlin Croft: 00:45:16.107 All right. So there’s lots of questions. So the first one I’m going to pick is, what does metrics in InfluxDB mean? You mentioned 240 billion of what? Measurements? Megabytes? What are the metrics?
Dr. Alessandro Silenzi: 00:45:33.715 A field value that would be, right? Does that answer?
Caitlin Croft: 00:45:40.995 Yeah, I think it’s the number of data points in InfluxDB, and it all depends on how many — the number is incredible, but it depends on the cardinality of it, right, of how much data it takes up?
Dr. Alessandro Silenzi: 00:45:55.939 Yeah. Okay, am I able to scroll through? Ah, backup. Yeah, look at that. Yeah, so if you look at the broker data rate, it’s more or less this. Through all broker topics, you see around this fluctuation. So say data rates of around 80 kilo-messages a second, and a message contains — okay, it could be a message that is not configuration change, so you need to remove some factors, but — and by the way, this is a nice Grafana panel. Yeah, so you can see that the system sort of is continuously acquiring around 8,000, around, messages a second, which is okay. So I think it’s more or less getting there. I hope that answered that.
Caitlin Croft: 00:47:02.962 You indicated that you study the structure of matter. Are you also studying the structure of anti-matter?
Dr. Alessandro Silenzi: 00:47:09.378 No. Next question. [laughter] No.
Caitlin Croft: 00:47:13.981 That’s easy enough. Is this data —?
Dr. Alessandro Silenzi: 00:47:16.124 Yeah. I can get more into that, but maybe at the end.
Caitlin Croft: 00:47:20.135 Okay. Yeah. We got lots of questions here. So is the data logger look like a serious-size storage problem?
Dr. Alessandro Silenzi: 00:47:31.526 Yeah. Ish. I mean, it’s not really that bad. So the thing that I’m honestly impressed if I look back is how we got away with some ASCII system for years. Of course, there were limiting factors. For example, we were limiting the history to three months for logical data storage issues. The problem we were having more significantly was the fact that it was not really a plannable size because being an event-driven system, you don’t really know the rate which hold the properties. So we can count how many properties we have very well, which is around a million. Well, almost two, actually, now a day, average. But it’s hard to pick this number and multiply by a multiplier factor. So you need to really have a system commissioned and then you have an idea. But just to [inaudible] for my colleague at the Data Management Group, this is peanuts with respect to what they manage with large data coming from X-ray cameras acquiring at 270 frames every 10 hertz. They can have a system acquiring for a week, and we acquire around 500 terabytes. And that’s more of a serious — that, I see, as more of a serious issue. And that sort of is a core part of our — so that challenge is part of the core problems that we solve at XFEL.
Caitlin Croft: 00:49:19.370 Cool. Let’s see. Do timestamps cause serious problems in the case of storage? If so, wouldn’t it be easier if there was one timestamp and the rest of the stream in regular time series? So all time-series data has a timestamp regardless of its regular or irregular time series data. I’m not sure if you want to speak more to the importance of having timestamp data.
Dr. Alessandro Silenzi: 00:49:51.644 So what we did, so I don’t know if I understood the question. Maybe the requester can ask the question again. But for example, we could have a configuration message coming through the data logger with data tagged on different timestamps. And what we do is essentially, every — so this would belong to one timestamp, one measurement, and all the properties that belong to that timestamp. Then when a new timestamp within the same message comes, this would go onto a different line and will also be inserted parallel. So this is not a problem. So going back and forward a few milliseconds really was not a problem for InfluxDB or Telegraf in the injection side. What was the problem was when we really had a bug on a device that was extrapolating some number and said, “No, I think this data comes from the future,” and then you had it exploding the sharding boundaries. That was dramatic. But multiple timestamps per message, that was not a problem.
Caitlin Croft: 00:51:09.094 And are you primarily processing regular or irregular time series data?
Dr. Alessandro Silenzi: 00:51:15.276 Oh, that goes into the data. I was not prepared for this very computer science question. So the not regular time series data means that, let’s say, the interval is irregular, or does it mean that the order is not regular?
Caitlin Croft: 00:51:36.079 I think yeah, it’s when the time interval is irregular.
Dr. Alessandro Silenzi: 00:51:39.702 Okay. That is not regular because by default, we — so for example, let’s say one offender, we could have, is — for example, you have vacuum gauges that are generally not polled, and they will send us new data when there is a vacuum change, so there is a change in the condition of the vacuum. So this happens rarely, hopefully. And, for example, if you have a venting or an evacuating process, you will see a lot of data vary in very compact time, and then essentially flat. So I hope that answers.
Caitlin Croft: 00:52:22.412 All right. We’re going to take one more question. Does the bandwidth that hard drives use in data acquisition require uninterrupted transmission? If there were hard disks enabling uninterrupted data transport by the only controller, the second one to read and modify, would it be used in the construction of the data acquisition system for the accelerator?
Dr. Alessandro Silenzi: 00:52:50.592 Oh, I do not — I’m not sure how to answer that. So if there were hard disks enabling uninterrupted data transport — so yeah, I’m not sure I understood the question, to be honest. But the —
Caitlin Croft: 00:53:16.773 Mikel, I know you asked this question. I’m going to allow you to — you should be able to un-mute yourself right now if you want to expand upon your question.
Dr. Alessandro Silenzi: 00:53:24.064 Yeah, that would be great.
Caitlin Croft: 00:53:26.459 Not to put you on the spot. If you feel comfortable speaking, go for it.
Attendee: 00:53:30.969 Hi. Okay —
Caitlin Croft: 00:53:32.009 Hi.
Dr. Alessandro Silenzi: 00:53:32.510 Hi.
Attendee: 00:53:33.008 I’m Michael.
Dr. Alessandro Silenzi: 00:53:34.129 Thanks for — hi, Michael.
Attendee: 00:53:36.549 Nice to hear you, Alessandro. The question basis is that I was working 10 years for WD, and I have a peer with a kind of idea of construction of hard drives with a dual actuator and a dual controller. I felt that in case of such use case as an accelerator, there could be a usable a situation when hard drives wouldn’t be some kind of bottleneck, and there could appear some kind of drive that will provide an uninterrupted stream of data tied to the disk. As you know, when you are writing something to disk, a strange situation happens and sometimes data buffers, and then data — especially for SSD drives when you’re trying to put data on SSD drives and you are not able after 20 megabytes or something like that because your buffers are fulfilled. In case of hard drives, you can create some kind of wide pipe for such data throughput. My question is if such drives will appear on the market, it will be usable, or you are not interested in such construction?
Dr. Alessandro Silenzi: 00:55:18.297 Okay. So this sounds like a promising feature, to be honest —
Attendee: 00:55:23.286 That’s my idea, actually.
Dr. Alessandro Silenzi: 00:55:25.746 Yeah. Okay, then good idea. So one thing we — but maybe not for — so really the level we have with the standard hard drives for time-series database, it’s sufficient. Maybe that’s some experimental support and that for data acquisition really for the larger volume of data that we have might be interesting in the sense that you could have data acquisition coming in one go and some monitoring done on the live data on the similar issue. But maybe we get away with just looking at the data before it gets written out because anyway, sort of all the system is distributed, so maybe the bottlenecks are somewhere else. I don’t know. But I’ll send this idea through because that sounds interesting for sure.
Attendee: 00:56:33.072 Thanks a lot.
Dr. Alessandro Silenzi: 00:56:34.411 Thank you.
Caitlin Croft: 00:56:36.502 Awesome. Well, thank you, everyone, for joining today’s webinar. There are a lot of fantastic questions. Thank you, Alessandro. I think you did a fantastic job. If you guys have any more questions for him, please feel free to email me. I’m happy to connect you with him. Once again, this talk will be made available for replay, as well as you can review the slides by tomorrow morning. Thank you, everyone, for joining. Thank you, Alessandro.
Dr. Alessandro Silenzi: 00:57:06.828 Bye.
Dr. Alessandro Silenzi
Team Leader, Controls Development Team, European XFEL GmbH
Alessandro Silenzi is the Team Leader of Control System Development at the European XFEL GmbH. In his role, he is responsible for the maintenance and development of the Karabo Control System. Since the winter of 2019, the European XFEL control system has used InfluxDB to store the experimental conditions.