Have you seen this when using the Flux code editor?
Pretty neat to be able to get that much help from an editor while writing code.
Have you ever wondered how that worked? Today, I’ll take us on a bit of a deep dive on the behind-the-scenes that enable these autocompletion features in the editor.
How does a code editor know enough about the code you write in order to help you out?
The answer is something called static analysis through which we can analyze a Flux script and learn lots of information about what the code does without actually running the code. It is important that we do not have to run the code to learn about it because — as the author of the code — you may not be ready to run the code yet as it may edit a file or write data to a database. That is why the technique is called static analysis; the analysis is performed on only static inputs, i.e. the Flux code.
What do we want to learn about the code when we analyze it?
The primary information we are trying to learn when we analyze the code are the types of the expressions in the code. For example, if you have this line in the code
threshold = 42, we learn that there is a variable named
threshold and that its type is an integer. In the animation above, you can see that the editor was able to suggest the arguments to the
aggregateWindow function. This is possible because we know the type of the function
aggregateWindow and the type includes the information about what arguments the function accepts.
Types are a way of specifying which values make sense in what contexts. So when you type
aggregateWindow we know that the name
aggregateWindow is a variable in the program scope that has the function type, so we can make suggestions for how to complete the rest of the code using that type.
How does static analysis algorithm learn the types of the expressions and variables in the code?
We use a static analysis technique called type inference, which is the idea that looking at the code you can infer the types based on how you see the code used. For example given this code, what do you think the type of the
m variable is?
m = "Hello World" print(m)
m is a string. It is clear from the code even though the word
string is not present in the code. Contrast that code with the Go code equivalent:
var string m = "Hello World" print(m)
string keyword tells the Go compiler that the variable
m has the string type.
Type inference is the idea that Flux can infer the type of the values within the program based on the context that those values are used. This means that authors of the code do not need to explicitly state the types of variables in their code.
If you are familiar with Go, you may be thinking that you can use the
:= syntax instead. That is correct and is in fact a limited case of type inference within Go as well.
Flux does not require type annotations in the code because it can infer the type of any value without them.
While the above examples are simple to infer the types, in general it’s not easy to determine what the types are for an arbitrary Flux script by just looking at it. In order to consistently and accurately determine the types in any Flux program, we need a set of rules (i.e. an algorithm) to define how types can be inferred. The Flux team uses a specific algorithm for type inference known as Algorithm W.
The specifics of Algorithm W are very nuanced, so we will hand-wave away most of those details. Instead, we will cover here just the raw structure of the algorithm.
To understand how Algorithm W works, let’s first look at another problem that may be familiar to you. Remember algebra and solving a system of equations? It’s those problems where they gave two equations that had two variables, and you had to solve for both variables. Those problems are solved in a very similar manner to how Flux solves for the types in a program using Algorithm W. For example, given these two equations with variables
y, you can solve for the variables.
2x + y = 5 y = 2 + x
The process of solving these variables involves substituting one variable equation into the other equations. We have an equation of what
y is equal to, so we can substitute all values of
y in the first equation with the right-hand side of the
y = equation like this:
2x + (2 + x) = 5
Now that we have only a single variable
x in the equation, we solve for
x using the rules of algebra. Here are the steps explicitly:
2x + (2 + x) = 5 3x + 2 = 5 3x = 3 x = 1
We have solved for
x and learned its value is
1. Now we can use the equation for
y to determine its value.
y = 2 + x y = 3
Using this process, we have solved for the values of
Type inference is similar except that instead of solving a system of algebraic equations, we solve equations about types. We use substitution and the rules of the types to solve for the type of every expression in a Flux program.
Here is an overview of the process:
- Assign a type variable to each Flux expression in the program.
- Generate all the equations about the types.
- Solve that set of equations.
First, for each expression in a Flux script, we assign it a new type variable. (We call them type variables because they represent types; we do not know which type until we solve for them.)
Our goal is to solve for those type variables. To do so, we need some equations. We get those equations by looking at how each type variable is used within the Flux program.
Once we have collected all the equations, we can solve them. We do this the same way we solve an algebraic equation, by rearranging the equations according to the rules and using substitution to replace variables in the equations until we have a simple
X = Type equation for each type variable.
Here is a simple hello world Flux program that will print “hello world” when run via the REPL.
a = "hello " b = "world" a + b
We can follow the process that type inference takes to know the types of all of the expressions in this program. I have chosen this example because all expressions in this program have the type string, and this makes it easy to see the solution we are working towards as we take each step.
The first step is to assign the type variables to the expressions. This program has three expressions: one for each line. Let’s give these type variables the names
Z for each line in order.
a = "hello " // X b = "world" // Y a + b // Z
Notice that we are talking about two different kinds of variables:
- Variables in the Flux program itself like
- Type variables like
To help differentiate between the two, I will use upper case letters for type variables and lowercase letters for program variables.
Next, we need to gather all the equations we know about those type variables.
First, we have the type variable
X with the expression:
a = "hello "
Since we see that only a string literal is used, we know that the expression must have a string type. We can write down an equation for
X stating that
X must be equal to the type
X = string
We also see that we assigned a value to the program variable
a, so we write down the type variable that is associated with the normal variable like this:
a -> X
We track the association of the program variables to their corresponding type variables so we can look them up whenever they are used in other expressions.
The next expression for type variable
b = "world"
This is the same as
X from a type perspective; therefore, we get the following equation and type variable association.
Y = string b -> Y
Finally, we need to write down an equation for
Z, remember its expression:
a + b
This expression doesn’t have any hints as to what type it is because only program variables are used. However, we can still learn something useful from the expression.
We know that both sides of the
+ operator must be the same type. This is what we mean by using the rules of types: certain operators must be used in specific ways, and we write down equations to capture those constraints.
Since we don’t know the type yet, we create a new type variable
W as a placeholder for now. Now, we can write down some more equations using
W and the
lookup function to indicate we need to lookup the association of a program variable to its type variable.
W = lookup(a) W = lookup(b) Z = W
We can lookup the type variable for
b and update the equations.
W = X W = Y Z = W
Now we have examined all the expressions in the program, and we can move onto the next step to solve those equations. This part proceeds in the same way as solving the algebraic equations. We start substituting variables for their equation until we learn the type of every variable.
Here are all the equations we know at this point:
X = string Y = string W = X W = Y Z = W
We can substitute the type variables
Y and get these simplified equations.
W = string W = string Z = W
Now we know a value for
W and can substitute it in the equation for
Z like so:
Z = string
We now have a set of equations and have solved for all type variables.
X = string Y = string Z = string W = string
We can now say that all expressions in the Flux script are strings as we expected when we first saw the program.
Taking a step back, we can start to see how to systematically solve for the types of expressions within a Flux program. We can also see how we can find errors in programs. For example, take this slightly modified example program:
a = "hello " b = 1 a + b
When creating the type equations for this program, we would get the following:
X = string Y = integer W = X W = Y Z = W
Using substitution, we would get:
W = string W = integer
At this point, we have discovered a type error because
W cannot be both equal to a string and an integer. When there is not a valid solution to the type equations, then there is a type error in the program. This error is valuable because it tells us about a mistake in our program before we ran it.
That’s type inference in a nutshell; generate a bunch of equations based on how the expressions are used within the Flux program and then solve that set of equations for all the type variables. Hopefully, this has shed some light on how we perform static analysis (i.e. type inference) in order to learn the types in a Flux program. Armed with the knowledge of types, an editor can then use those types to provide helpful suggestions and error messages when developing Flux scripts.
Want to learn more about type inference?
Naturally, type inference is more complex in practice since even simple Flux programs have lots of expressions, and therefore, lots of equations to solve. Solving a large number of equations can be very difficult to do efficiently and accurately. Imagine trying to solve a system of hundreds of algebra equations. That would be very tedious and easy to make a mistake.
Additionally, as you may remember from algebra class, picking which equations to substitute into which other equations can make solving the problem either easy or extremely difficult. For algebra this becomes a bit of an intuitive art that students get good at with practice. Unfortunately, computers are not good at intuition. That is where Algorithm W comes in. It is a specific algorithm for how to generate and solve type equations efficiently and correctly.
If you want to learn more, there are decades of research and literature on these topics. Here are a few pointers to get you started.
- Algorithm W is a class of type inference algorithms known as Hindley-Milner.
- We also use an extension to Algorithm W for extensible records see the paper.
- Try out the language SML it has a very simple type system based on Hindley-Milner and is a quick language to learn. It’s similar to Haskel and OCaml but drastically easier to learn and simpler to understand.
- There are several implementations of various type inference algorithms along with their research in a GitHub repo.
I personally found no shortcuts in trying to learn type inference; most of the literature assumes some basic understanding of the concepts of types and computational theory. I would start with lambda calculus as all literature about Hindley-Milner assumes familiarity with lambda calculus. Then follow up with experimentation and practice on your part.
Additionally, I haven’t mentioned yet the Flux language server which is responsible for using the type information learned from static analysis and translating that into suggestions the editor can then present. It’s the critical piece that makes this all work. Check out the Flux LSP implementation on GitHub.
That’s all for today, thanks for following along.