I do all my coding in Rust. I’ll try to explain how other languages handle similar concepts, but may get those explanations/details wrong. If I messed up, let me know and I will correct them.
I’ve become very enamoured with neural networks, and believe they will raise the bar for baseball analytics in the public space. To aid my learning process, I decided it was time to build a neural network from the ground up, and today, I’m going to share with you the core code, which is of course written in Rust. I’m still tinkering with the code a lot, so will hold off on publishing the full code base until it’s ready.
My Intro Article on the Subject:
Get Started with Rust
Let’s Dive into the Code!
Dependencies
In Rust, if you’re using Cargo (you should be), you’ll have a Cargo.toml file where you’ll list the crates you depend on. At the time of writing I’m relying on 5 crates to make my project run. I’ll discuss them as we get to relevant parts of the code.
Step 1 Reading CSVs
Here’s the full reading CSV code as one large block. I’ll break it down piece by piece with copy-pastable code.
The Function Signature
pub fn read_csv_regression (file_name: &str, csv_columns: Vec<&str>) -> Vec<Vec<Option<f32>>> {
The keyword pub means that the function is public, i.e. available to be called by other parts of the code. Rust makes everything private by default. It can be annoying at times, however, generally speaking, you want to be explicit about the parts of your code that are allowed to be used by others (including yourself).
The way to read the function signature in English goes something like this:
read_csv_regression is a public function which receives as input a &str and a Vec<&str> and returns a Vec<Vec<Option<f32>>>
You’re probably wondering what a lot of those things are, so let’s break them down.
&str is a string slice (read: text). There is a TON of nuance when it comes to strings, so for now just think of it as something that takes a text field. The & symbol means we’re borrowing the string, which means what it sounds like. When we borrow something in Rust, we must be able to return it to the original owner in exactly the same condition we received it. This means that we are going to receive a string, but we can’t change that string at all. We have to use the string as is. We can read it, but can’t change it.
Vec<&str> is a vector (read: list/array) of &str. In JavaScript you’ll see this as an Array type, in Python this is a List. In Rust, when we use a Vec or similar, each item in it must be of the same type. In R, this maps pretty cleanly, as a Vector requires a list of items of the same type, whereas a List can mix numerical and string types together.
Option<f32> is our way of telling Rust that the value is either Some value or None. In other words, Option is how potentially Null values are handled in Rust. This forces us to make decisions about how we handle null/empty values. Some languages will just make those decisions for you (JavaScript will just change them all to 0), not overly familiar with how other languages deal with this. The important concept to take away here is that we must wrap the type in an Option if it might be null. If you try to multiply a null value, Rust will panic and kill your program. It will never silently convert your types without you explicitly telling it to. We’ll discuss how this contrasts to other languages in part 2.
Vec<Vec<Option<f32>>> is a Vector of Vectors. The first (outer) Vector is a list of all the rows. The inner vector is the data for each row, which lists all the column values, which are all 32 bit floating point integers (f32) that could be null/empty. In C/C++ this is the float type. If we wanted more precision, we could use the f64 type, which in C/C++ would be called a double. In R, Python and JavaScript, floating point numbers will be represented by double precision floating point values.
That was way more words than I expected to write about a simple function signature, and I’m not even done yet! There’s one more very important concept with Rust functions:
Rust functions MUST receive exactly the types you specify in the signature and will ALWAYS return exactly the types you specify in the signature. In this case, we’re receiving a &str and a Vec<&str>. If we try to call the function and feed it anything else, the compiler will yell at us. This means that we can never accidentally feed a function too little or too much. If we change a function signature, the compiler will be able to show you all the places in your code that you need to update. This is one of Rust’s best features. If you refactor/change your function, you don’t need to worry about all the other places in your code you’re calling the function, the compiler will just surface a bunch of errors for you.
Future versions of Rust may allow you to create multiple versions of the same function, which take different inputs, but for now, that is not possible.
Using the CSV crate
let mut csv_reader = csv::Reader::from_path(file_name).expect("Couldn't open the csv");
Let’s explain what’s happening:
let mut csv_reader is defining a variable called csv_reader which is mutable. This means the value of csv_reader is allowed to change. By default, all variables in rust are immutable, meaning they can’t be changed. There are good reasons for this, which we’ll discuss when we get into the weeds of performance optimizations.
csv::Reader::from_path(file_name) uses the CSV crate. It uses the from_path function, from the Reader module. In Rust, these are specified with :: syntax. The from_path function takes an &str as its input, in this case, the same &str we are feeding into our function. This is how Rust composes all your code together. It knows exactly what each function is doing, even if you have 1000 dependencies. This is extremely important information for the compiler to produce code that is highly optimized (fast).
.expect(“Couldn’t open the csv”) is actually doing two things. First, the from_path function returns a Result. A Result is something that either is Ok (i.e. it didn’t return an error) or an Error. In this case we are expecting that the function will be able to return something that isn’t an Error, and in the case that it doesn’t we want the program to crash and give us “Couldn’t open the CSV” as an error message. The second thing it is doing is Unwrapping the Result and returning the value inside the Result. If you’re new to Rust, that’s not going to make a whole lot of sense right now. Don’t worry about it, we’ll cover it in much greater detail. What you should be taking away is that Rust is forcing us to be explicit about what we want to happen in case something goes wrong. In this case, we’re saying we want the program to crash (called a panic in Rust parlance) and give us a handy error message.
let column_headers = csv_reader.headers().expect("Couldn't get headers");
Let’s break this down as well:
let column_headers we’re defining a variable called column_headers. In this case, we don’t need it to be mutable (changeable) so we can leave out the mut keyword.
csv_reader.headers().expect("Couldn't get headers") - use the headers function and unwrap it’s Result, and crash if there’s an error.
Getting The Columns We Want
let col_nums: Vec<usize>
We’re creating a variable called col_nums, which will be a Vector of usize. Usize is an unsigned integer (an integer that’s always positive), and will be either 32 bits or 64 bits depending on the CPU it’s running on (or compiled for, not sure). What’s important to know is that when we index into a Vector in Rust, it expects a usize, so we want out indices to all be usizes.
= csv_columns.iter()
Here were taking the csv_columns that we’ve fed into our function (a list of column names) and turning it into a list that we will iterate over. Iterators are extremely powerful and allow the compiler to perform a host of optimizations. They also make parallel programming super easy, as we’ll see in a later post.
.map(|col| {
We now want to chain a bunch of operations on each item that we are iterating over. the .map functor (don’t go down the rabbit hole of category theory) will operate on every single item. The |col| represents a closure on the item we are iterating over and calls it “col”. We then start an opening brace { to signify that we want to write a multi-line expression. Rust is an expression based language, which I can’t really describe succintly. I’ll try to explain it as we go.
Now we begin our expression, which will map the col to an index. We’ll start by creating a new iterator. Yes, it’s iterators all the way down.
let col_num: Vec<usize> = column_headers.iter()
Now we iterate through the column_headers we got from reading our CSV file.
.enumerate()
This changes the value from an &str to a tuple of (item_number, &str). This will allow us to keep track of the column number associated with the column name.
.filter(|(col_num, col_name)| col == col_name)
We only want the column index if the column name for the csv file matches one of the names we fed our function. So we use the .filter() functor, creating a closure on the tuple the .enumerate() functor created, where our first closure (from the .map() all the way above) is equal to the current closure.
.map (|(col_num, _)| col_num)
.collect();
Map the tuple value, discard the column name and keep the index. In Rust, when we have an unused variable, the compiler expects you to use the _ to indicate that. We then collect it into the Vec<usize> above.
That’s the technical explanation for what’s going on. In simpler terms here’s what’s happening:
Loop over each column name that you fed the function
Loop over each column name that is in the CSV file and add an index to it
If the names match, keep the column index, column name pair
Keep the indices, and discard the names
Collect the indices into a Vector
Now there are two possible things that could go wrong. The source file could have the same column header twice. It also could simply not have the column you think it should have. In my opinion, the code should crash in these cases, so let’s add a couple of checks to protect against these problems.
if col_num.len() > 1 {panic!("Found {col} twice!")}
The previous snippet collected all the times it matched the column names. This should have a length of 1, but if it’s more than 1, we have a problem. This code will panic and crash the program, spitting out an error message telling you which column name it found twice.
*col_num.get(0).expect(&format!("Couldn't find {col}!"))
The * means we are dereferencing the value. This has to do with ownership and borrowing, and we’ll gloss over it for now. The key is to understand it as “we want the actual thing, not a reference to the thing”.
We then use the .get() function to access the first item. As with many fallible Rust functions, it returns a Result. We unwrap the result with the .expect() function and then spit out an error. Fomat! is a macro which returns a String.
Because this line is the last line of the function and doesn’t have a ; Rust understands that you want the expression to implictly return this. It’s equivalent to
return *col_num.get(0).expect(&format!("Couldn't find {col}!"));
… which is probably what you’re used to seeing in most other languages. Once you get used to it, it becomes second nature. Just remember to leave the ; off at the end of the last line, and it becomes an implicit return.
})
At this point we’ve mapped the value to one usize and handled the two errors we could have come across.
.collect();
We now collect the singular indices into one Vector of usize.
Reading the CSV Records
That was a lot of words just to get the relevant indices of a csv file. But it also was about 10-15 lines of Rust code, which demonstrates how much complexity Rust can encapsulate in a small amount of code.
Here’s the full block of code to read the csv, note how it is all one giant expression that we leave as the last thing in the function, which Rust understands to use as the return value for the function.
Let’s step through what’s going on and we’ll dive into the new syntax separately.
Read the CSV into records
Iterate over the records (into_iter is an “owned” iterator, don’t worry about the difference between .iter() and .into_iter() right now)
Filter out the bad records and keep the good records
Collect it all into a new Vector (annoying that we have to do an intermediate collection, but we need it for *reasons*).
Now that we have only good records, map each record to the Option<f32> we discussed above.
Collect all the Option<f32>s into a Vector.
Let’s zoom in on the code in the .map() section:
Let’s go line by line again:
let mut row: Vec<Option<f32>> = vec![];
We start by allocating a Vector which we’ll need to be mutable so that we can add items to it. The vec![] macro returns an empty Vector.
for col in col_nums.iter() {
This is syntax that will likely be familiar to a lot of you. It loops through the col_nums, allowing us to run operations on each value.
let val = record[*col].to_string();
Each record is essentially a Vector of column values. We simply index into the record using the column_number we got earlier. We then convert it to a string. There’s probably a performance optimization here to convert directly into an f32, but with Rust, it’s often unneccessary.
let val = val.replace(",", "");
Here we alias the val variable, essentially replacing the old version with this version. This is the Rusty way of doing things. Val starts as a String, but we’re going to change the type to an Option<f32>, so we need to explicitly specify how we are changing it at every step.
let val = if val.is_empty() {None} else {
Here we check if the val String is empty, if it is, we want it to have a None value, which is what an Option<T> will return if it is Null. Option<T> is how we express a generic Option. Don’t worry about what that means right now, but the terminology will be relevant later.
match val.parse::<f32>() {
Ok(v) => Some(v),
_ => None,
}
Now that we have the value, were going to use the amazing match statement in Rust. I’ll have a lot more to say about match, but for now, you can think of it like a Case or Switch statement, with the caveat that you must give the match statement all the possible situations it can encounter. In this case, if our attempt to parse the value as an f32 didn’t return an error, we then return Some(v), which is how an Option<T> will return a value if it has Some value in it, as opposed to None.
_ => None,
This means “ELSE Return None”. The underscore means “anything else”.
row.push(val);
Push the Option<32> into our row Vector. This should be analogous to push operations in most languages.
}
row
Implictly return the row vector
.collect()
Collect the row vectors into another vector and then implictly return that.
Conclusion
That was much longer than I expected it to be when I sat down to write this. I thought I would be able to fit the entire code-base into one article, but after the first section I realized there was no way that was ever going to happen. I hope this will help you understand Rust code a little bit more. In Part 2, we’ll discuss how we take the raw data and convert it into a useable data set we can use for building models.