Automated Stock Trading Model Using InfluxDB
This article was written by Tyler Nguyen in collaboration with Alex Wang, Andrew Depke, Weston Tuescher, and Neema Sadry.
Every year, the go-getter folks over at InfluxDB organize a friendly hackathon between all the interns. It’s an event where all of the interns from every department come together to brainstorm a range of ideas that would be able to be built in the short span of one week. For the 2022 intern hackathon, there were two teams made out of the eleven interns. To start the flow of ideas, there was a spreadsheet of possible projects to make, which detailed several brilliant creations and reflected the creativity of the intern team.
Ultimately, the two projects that were chosen were a three-dimensional interactive globe visualization for InfluxDB and a stock market sentiment bot. I was assigned to the latter of the two, consisting of four other members: user interface team intern Alex Wang, automations and compute team intern Andrew Depke, Flux team intern Weston Tuescher, and e-commerce team intern Neema Sadry.
After receiving our assignments, the team and I hopped into a Zoom meeting, and our synergy was so strong and effortless that we spared no minute, immediately starting to work. Our initial vision for the project went as follows: scrape historical stock data of stocks using available APIs at a fixed frequency, funnel that information into InfluxDB using the Python Client library, downsample the buckets using Flux queries, run a machine learning model to predict its trajectory, and graph all the relevant information on a simple website interface.
To kick off our discussion, we laid out a rough plan, documenting all the integral parts of the project, the specifications and requirements of each part, and even its estimated time of completion. From there, the conversation turned to our individual strengths and weaknesses, experiences with different technologies, and past projects. Helpfully, Andrew drew up a visual representation of the inner workings of how the project would work:
This rough overview of the project allowed us to easily compartmentalize each part into smaller tasks that can be given to each team member. For ease of use, we chose to use a Google Doc to organize and maintain an up to date log of our progress using a table; on this document, we also included sections for helpful resources, contact information, and a list of technologies we were going to use. Now that we had a platform to organize the logistics of the project, we needed a place to organize the project itself: GitHub was another easy pick because of its popularity and the small scale of our project.
Source: Screenshot of our Google Doc for organizing the tasks of the projects
Hacking away: My perspective
As we all went our separate ways to tackle our tasks, my first job was to familiarize myself with the InfluxDB Python Client library to be able to store the stock data into InfluxDB Cloud for querying. InfluxDB makes the documentation readily available in their documentation section and on the designated Python section of the Cloud platform. Even though there was a helpful starting block of code that was provided in the documentation, I still ran into some errors, specifically an SSL verification error, that should not have occurred considering I was only trying to run the given code. I tried many fixes with no success, from consulting with the very engaging InfluxDB community to using Pip to reinstall the SSL packages that come with Python; it wasn’t until I reached out to my mentors, Zoe Steinkamp and Anais Georgiou, who quickly helped me come up with a workaround. I substituted my organization email with my user ID in the “org” variable, which fixed the issue.
Now that I was able to send data to InfluxDB from Python, I could start planning how to write Python scripts to get the web scraped data into a format that the InfluxDB Cloud can understand. I coordinated with Neema Sadry, our team member who was in charge of scraping the stock market data from the APIs, and we agreed that outputting the scraped data to a CSV file with each line containing ticker symbol, current price, and company name would be the easiest way to package and parse the information within Python.
Due to the way we assigned our tasks, I could not work with the real stock data until Neema finished the scraping tool, so for testing purposes, I temporarily made my own CSV in the format we agreed on. My program was very simple: open the CSV before looping through each line, read each line from the CSV, split the line into a list on every comma, index into the list and feed each element into the appropriate field or tag in the InfluxDB cloud. After running the program with real stock data, this is what the bucket looked like with all the stock data in it:
Source: Data Explorer featuring our bucket that contains relevant stock data
Throughout the arduous week of problem solving, we slowly uncovered a host of obstacles that we could not anticipate. For example, we quickly found out that most market simulations are not free, meaning that we had to roughly estimate the behavior of our model and that some accuracy would have to be left on the table.
Another challenge we faced was that machine learning with time-series databases is not like traditional learning, in the sense that you need to architect models around relative periods of time, instead of bulk feeding timestamp-labeled data. Lastly, it was apparent that creating an efficient way to process and send large volumes of stock data between the client and server was inherently difficult because of the sheer load and processing time it would take. One possible solution we thought about was having a way to cache the stock data so that we did not have to fetch the results each time or dynamically loading the stocks rather than bulk load.
Though we had our fair share of challenges, there were a number of concepts that we deployed successfully and implemented smoothly.
Source: Yahoo Finance and Web Scraping Code
Our web scraping utility worked exactly as expected, which was imperative to our project as a whole. The data was scraped, cleaned, and sanitized without any problems and was easily sent to InfluxDB.
Source: Resulting model and illustration for how we trained our model
Our model was run at consistent intervals, where it pulled the latest stock data from our cloud instance, made predictions about how it will perform one minute in the future, and decided to either buy, sell or hold the stock. Even though the model did not behave as expected given the green is what it was supposed to look like and the red dotted line was what it actually looked like, we concluded that this was due to not having enough time to actually train the model with enough data as opposed to a technical failure. With more time, we could refine the model to make expectations match reality.
Source: Web UI to display stock data
Our simple web application for displaying stock data in a dashboard manner was linked to our InfluxDB Cloud Instance. As frequently as the Cloud updated, the website would be updated by a Python script that downloaded new data from the bucket in a JSON format understood by our interface and was appended to the current data being displayed on the website.
As the project came to an end, we reflected back on the long week of coding. There are definitely things we could have done differently and improved on. To start, a brand new trading model built to work natively with multiple stocks, highly-varied data, and timestamped data would be much more effective in our opinion. It is also possible that better integration with Telegraf as a way to run our model automatically when new data is present would be more productive than our current automation of running a script every thirty seconds. If we wanted to scale the project to work with more than just a handful of stocks, we would need to consider the effect of the load size on the client and server. Having efficient ways to store and retrieve data is crucial and large amounts of data can quickly slow the server down when the size of the project increases. The final major change we would include is the extraction of sentimental data in addition to ticker data because it can lead to much more profitable models, despite how hard it is to obtain.
Overall, the InfluxDB 2022 Intern Hackathon was a rich and insightful experience. Not only did it allow us to experience the learning curve that comes with using new technologies, we were also able to brush up on some old skills from long ago that may not have been useful in recent times or reinforce on relevant skill sets, which is always beneficial as a programmer. Within the short time of one week, we were able to build a relatively complete program that is versatile enough to be used in larger scale applications and is only telling of what future interns will build at InfluxDB.