<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>InfluxData Blog - Rohan Sreerama</title>
    <description>Posts by Rohan Sreerama on the InfluxData Blog</description>
    <link>https://www.influxdata.com/blog/author/rohan-sreerama/</link>
    <language>en-us</language>
    <lastBuildDate>Wed, 09 Sep 2020 06:00:54 -0700</lastBuildDate>
    <pubDate>Wed, 09 Sep 2020 06:00:54 -0700</pubDate>
    <ttl>1800</ttl>
    <item>
      <title>A Deep Dive into Machine Learning in Flux: Naive Bayes Classification</title>
      <description>&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Machine_learning"&gt;Machine learning&lt;/a&gt; – the practice of writing algorithms that improve automatically through experience – has become a buzzword nowadays that connotes to something otherworldly and on the bleeding edge of technology. I’m here to tell you while that may be true, getting started with machine learning doesn’t have to be hard!&lt;/p&gt;

&lt;p&gt;InfluxData annually hosts a Hackathon for interns. An intern myself, my team of 3 decided to take a stab at implementing a Slack-incident classifier to improve the daily workflow for the Compute Team. Although &lt;a href="https://scikit-learn.org/stable/"&gt;Scikit&lt;/a&gt; (a machine learning library for Python) can be used to easily classify Slack incidents, we decided to write a classifier from scratch in &lt;a href="https://docs.influxdata.com/flux/v0.65/introduction/getting-started/"&gt;Flux&lt;/a&gt; (InfluxData’s data scripting and query language). At times it was fun and frustrating, but overall it was exhilarating. We chose Flux for its unique data-intensive capabilities that let us succinctly operate on time series data.&lt;/p&gt;

&lt;p&gt;Our goal: Classify data using a Naive Bayes Classifier written in Flux.&lt;/p&gt;
&lt;h2&gt;What is a Naive Bayes Classifier?&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier"&gt;Naive Bayes classifiers&lt;/a&gt; are a family of simple probabilistic classifiers based on applying &lt;a href="https://en.wikipedia.org/wiki/Bayes%27_theorem"&gt;Bayes’ theorem&lt;/a&gt; with strong (nai?ve) independence assumptions between the features. A &lt;a href="https://en.wikipedia.org/wiki/Probabilistic_classification#:~:text=In%20machine%20learning%2C%20a%20probabilistic,the%20observation%20should%20belong%20to."&gt;probabilistic classifier&lt;/a&gt; is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. After a probabilistic classifier like Naive Bayes has been trained, predictions are made by determining the class with the highest probability.&lt;/p&gt;

&lt;p&gt;&lt;img class="wp-image-250161 size-full" src="/images/legacy-uploads/bayes-theorem.png" alt="bayes theorem" width="937" height="426" /&gt;&lt;/p&gt;
&lt;figcaption&gt;&lt;a href="https://en.wikipedia.org/wiki/Bayes%27_theorem#/media/File:Bayes'_Theorem_MMB_01.jpg"&gt;Bayes Theorem&lt;/a&gt;&lt;/figcaption&gt;

&lt;p&gt;We embarked on a daunting task by learning and writing code in Flux, a functional programming language, in a span of 2 days.&lt;/p&gt;
&lt;h2&gt;A couple of things about Flux&lt;/h2&gt;
&lt;ol&gt;
  &lt;li&gt;Functional programming is different! Flux takes advantage of this by eliminating typical code constructs and using its own special functions which efficiently execute data-intensive tasks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code class="language-javascript"&gt;map(fn: (r) =&amp;gt; ({ _value: r._value * r._value}))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://v2.docs.influxdata.com/v2.0/reference/flux/stdlib/built-in/transformations/map/"&gt;Flux map function&lt;/a&gt; iterates through every record applying a specified operation.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Working with data is mostly done in a tabular fashion. In other words, you pass in and return enormous tables of data in functions, which makes performing complex calculations very intuitive and friendly. Take the &lt;a href="https://v2.docs.influxdata.com/v2.0/reference/flux/stdlib/built-in/transformations/aggregates/reduce/"&gt;Flux reduce function&lt;/a&gt; for example - it computes aggregate data for entire columns using a reducer function:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-javascript"&gt;reduce(
     fn: (r, accumulator) =&amp;gt; ({ sum: r._value + accumulator.sum }), 
     identity: {sum: 0.0}
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With that in mind, let’s get to the meat of this thing!&lt;/p&gt;

&lt;h2&gt;What our demo does&lt;/h2&gt;

&lt;p&gt;What we’ve got for you &lt;a href="https://github.com/RohanSreerama5/Naive-Bayes-Classifier-Flux"&gt;on GitHub&lt;/a&gt; is a Naive Bayes classifier implementation that currently predicts the following:&lt;/p&gt;

&lt;p&gt;&lt;code class="language-markup"&gt;P(Class | Field)&lt;/code&gt;          (Probability of a class given a field)&lt;/p&gt;

&lt;p&gt;Our dataset utilizes binary information about zoo animals. For instance, we have a buffalo with numerous fields like backbone, feathers, eggs, etc. Each of these fields is assigned a binary true or false based on their presence in the animal (0 or 1).&lt;/p&gt;

&lt;p&gt;&lt;img class="wp-image-250164 size-full" src="/images/legacy-uploads/naive-bayes-classification-flux.png" alt="naive bayes classification flux" width="977" height="441" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Zoo dataset (binary)&lt;/figcaption&gt;

&lt;p&gt;Our implementation currently makes predictions based on a single input field, aquatic. Our classifier predicts whether an animal is airborne or not. We used a Python script to write data into an InfluxDB &lt;a href="https://v2.docs.influxdata.com/v2.0/organizations/buckets/create-bucket/"&gt;bucket&lt;/a&gt;. In doing so, &lt;code class="language-javascript"&gt;'airborne'&lt;/code&gt; is set up as a &lt;a href="https://v2.docs.influxdata.com/v2.0/reference/key-concepts/data-elements/#tags"&gt;tag&lt;/a&gt; in the Pandas DataFrame to initialize it as a Class, and the rest of the attributes are defaulted to an InfluxDB &lt;a href="https://v2.docs.influxdata.com/v2.0/reference/glossary/#field"&gt;field&lt;/a&gt; type.&lt;/p&gt;

&lt;p&gt;The following Flux code is the beginning of our Naive Bayes function. It allows you to define a prediction Class, bucket, prediction field, and a measurement. It then filters your dataset to divide it into training and test data.&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-javascript"&gt;naiveBayes = (myClass, myBucket, myField, myMeasurement) =&amp;gt; {

training_data = 
     from(bucket: myBucket)
     |&amp;gt; range(start: 2020-01-02T00:00:00Z, stop: 2020-01-06T23:00:00Z)     // data for 3 days
     |&amp;gt; filter(fn: (r) =&amp;gt; r["_measurement"] == "zoo-data" and r["_field"] == myField) 
     |&amp;gt; group() 

test_data = 
     from(bucket: myBucket) 
     |&amp;gt; range(start: 2020-01-01T00:00:00Z, stop: 2020-01-01T23:00:00Z)    // data for 1 day 
     |&amp;gt; filter(fn: (r) =&amp;gt; r["_measurement"] == "zoo-data" and r["_field"] == myField) 
     |&amp;gt; group() 
...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can predict the following:&lt;/p&gt;

&lt;p&gt;&lt;code class="language-markup"&gt;P(airborne | aquatic)&lt;/code&gt;        (Probability that a given animal is airborne provided whether it is aquatic or not)&lt;/p&gt;

&lt;p&gt;For instance, the probability that an antelope is airborne given it is not aquatic is 69%.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-markup"&gt;P(antelope airborne | !aquatic) = 0.688&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;img class="wp-image-250165 size-full" src="/images/legacy-uploads/bayes-theorem-probability-flux.png" alt="bayes theorem probability flux" width="945" height="790" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Predictions for the Probability that an animal is airborne given whether it is aquatic or not&lt;/figcaption&gt;

&lt;h2&gt;How does all this work?&lt;/h2&gt;

&lt;p&gt;We’ve essentially divided our dataset based on time? – 3 days for training and 1 day for testing. After some data preparation, we create a probability table that is calculated using only our training data. At this point, our model is trained and ready to be tested. Finally, in &lt;code class="language-javascript"&gt;predictOverall()&lt;/code&gt; we perform an inner join on this table along with our test data to compute an overall Probability table that contains predictions for animal characteristics.&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-javascript"&gt;...
// calculated probabilities for training data
Probability_table = join(tables: {P_k_x_class: P_k_x_class, P_value_x: P_value_x},
    on: ["_value", "_field"], method: "inner") 
    |&amp;gt; map(fn: (r) =&amp;gt; 
    	({r with Probability: r.P_x_k * r.p_k / r.p_x}))

// predictions for test data computed 
predictOverall = (tables=&amp;lt;-) =&amp;gt; {
    r = tables
      |&amp;gt; keep(columns: ["_value", "Animal_name", "_field"])
      output = join(tables: {Probability_table: Probability_table, r: r},
      on: ["_value"], method: "inner")
      return output
}

test_data |&amp;gt; predictOverall() |&amp;gt; yield(name: "MAIN") 
...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the future, we plan on supporting multiple prediction fields and leveraging the power of &lt;a href="https://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;density functions&lt;/a&gt; to make more interesting predictions. Moreover, we hope to scale this production to be able to save trained models externally in a &lt;a href="https://www.influxdata.com/glossary/sql/"&gt;SQL&lt;/a&gt; table. This project will soon join the &lt;a href="https://v2.docs.influxdata.com/v2.0/reference/flux/stdlib/contrib/"&gt;InfluxData Flux open-source contribution library&lt;/a&gt;, so stay in the loop!&lt;/p&gt;

&lt;p&gt;You can run this demo with zero Flux experience! So what are you waiting for? Get started with &lt;a href="https://github.com/RohanSreerama5/Naive-Bayes-Classifier-Flux"&gt;ML in Flux right here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img class="size-full wp-image-250167" src="/images/legacy-uploads/naive-bayes-flux-influxdb.png" alt="naive bayes flux influxdb" width="937" height="734" /&gt;&lt;/p&gt;
&lt;figcaption&gt;An overview of Naive Bayes Classification in InfluxDB Data Explorer&lt;/figcaption&gt;

&lt;p&gt;Huge thanks to Adam Anthony and Anais Dotis-Georgiou for their invaluable guidance and support during this project. And much love to Team Magic: Mansi Gandhi, Rose Parker, and me. Be sure to follow the &lt;a href="/blog/"&gt;InfluxData blog&lt;/a&gt; for more cool demos!&lt;/p&gt;

&lt;h3&gt;Relevant links:&lt;/h3&gt;

&lt;ul&gt;
 	&lt;li&gt;InfluxDB: &lt;a href="https://github.com/influxdata/influxdb"&gt;https://github.com/influxdata/influxdb&lt;/a&gt;&lt;/li&gt;
 	&lt;li&gt;ML Datasets: &lt;a href="https://archive.ics.uci.edu/ml/datasets.php"&gt;https://archive.ics.uci.edu/ml/datasets.php&lt;/a&gt;&lt;/li&gt;
 	&lt;li&gt;More great demos: &lt;a href="/blog/"&gt;https://www.influxdata.com/blog/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
      <pubDate>Wed, 09 Sep 2020 06:00:54 -0700</pubDate>
      <link>https://www.influxdata.com/blog/deep-dive-into-machine-learning-in-flux-naive-bayes-classification</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/deep-dive-into-machine-learning-in-flux-naive-bayes-classification</guid>
      <category>Product</category>
      <category>Use Cases</category>
      <category>Developer</category>
      <author>Rohan Sreerama (InfluxData)</author>
    </item>
  </channel>
</rss>
