Getting Started with agnes


This guide for using the agnes data wrangling library will walk you through a fairly typical, if mundane, data science preprocessing task: loading some source data files, joining them together on the appropriate fields, selecting the specific data you’re looking for, and performing some basic transformations.

This guide is intended for people who have some failiarity with Rust.

You may also be interested in the API documentation.

Data

For this guide, we will be working with data from The World Bank to examine the link between life expectancy and GDP for various countries. We’ll be working with three individual data files (LICENSE):

Initializing our Project

We’ll start by creating a new cargo package for this guide. At a command prompt, navigate to the directory you want to create your project in, and type:

$ cargo new --bin gdp_life

This creates a new cargo package in the gdp_life directory. We specified --bin to indicate that we are making binary (runnable) application instead of a library.

Next, we need to modify the [dependencies] sections of that package’s Cargo.toml file to load agnes:

[dependencies]
agnes = "0.3"

With the appropriate library loaded, we’re ready to move on to actual code!

Defining Tables

The first bit of code we need to write is the data table definition for the GDP table. agnes handles data loading in a statically-typed manner, and thus we need to tell agnes what data types we are loading and how we want to label them.

In agnes, fields (i.e. columns in your data set) are labeled by unit-like structs by which we can refer to an individual field at the type level, as opposed to using a runtime value (for example, a field name as a str or a usize field index or enum variant). This provides us with valuable compile-time field type checking and prevents us from running into some avoidable runtime errors.

We define these field labels and their associated data types with a macro called tablespace, which generally looks like this:

tablespace![
    table my_first_table {
        Table1Col1: u64,
        Table1Col2: String,
    }
    table my_second_table {
        Table2Col1: f64,
        Table2Col2: f64,
    }
]

In this example code we are declaring two tables: my_first_table and my_second_table, each with two fields (in the first table, an unsigned integer and a string, and in the second table, two floats). The tablespace macro will generate a module for each table, as well as the label structs for each field. For example, the call to tablespace above will create the module my_first_table, which will include within it the types Table1Col1 and Table1Col2. Note that you can preface the word table in these declarations with visibility modifiers (e.g. pub).

In most cases we will want to define all tables we’re loading in an application within a single tablespace macro call. Theoretically we could call tablespace multiple times, but we would not be able to combine fields from tables in different tablespaces.

Getting back to our GDP table example, we first have to load the agnes create in our main.rs file:

#[macro_use]
extern crate agnes;

Next, the appropriate tablespace call would be:

tablespace![
    table gdp {
        CountryName: String,
        CountryCode: String,
        Gdp2015: f64,
    }
]

This generates the gdp module and the label structs CountryName, CountryCode, and Gdp2015, along with some code to make these label structs work within the agnes framework. Since this code generates a module, it is best placed outside the main or any other functions.

Our data source contains GDP data for the years from 1960-2017. We’ve chosen (arbitrarily) to use the 2015 GDP data for our purposes.

Later on, we’ll add to this tablespace, but for now let’s continue with only the gdp table defined.

Loading a Data Set

Now that we’ve defined the appropriate table and fields, we can move on to loading this table from a data source. To do this, we provide a source schema – a description of how to hook up data from the source file to the fields we described in our call to tablespace above. This is done with another macro, schema.

Within our main function, we add the following code:

let gdp_schema = schema![
    fieldname gdp::CountryName = "Country Name";
    fieldname gdp::CountryCode = "Country Code";
    fieldname gdp::Gdp2015 = "2015";
];

This defines a new source schema using the schema macro, and stores the result in gdp_schema. The schema macro syntax is a list of fieldname or fieldindex declarations that connect the field labels to column headers or column indices (starting from 0), respectively. In this case, we are providing three column headers: the CountryName field label will take data from the column with the “Country Name” header, the CountryCode field label will take data from the column with the “Country Code” header, and the Gdp2015 field label will take data from the column with the “2015” header.

Next, we want to load the CSV file. We could download the CSV file locally and load that, but for convenience, we’re going to have the program load the CSV file directly from the web. Fortunately, agnes has a utility function for loading a CSV file from a URI string. At the top of our main.rs file, after the extern crate line, let’s add the following import:

use agnes::source::csv::load_csv_from_uri;

The load_csv_from_uri function takes two arguments: the URI you wish to load the data from, and the source schema for the file. It returns a Result containing the DataView object containing our data. While agnes strives to limit the number of runtime errors, the types of errors that can occur while loading a file (incorrect URI, network error, etc.) are not predictable, so this particular functionality does require some runtime error-checking. Our call will look like this:

let gdp_view = load_csv_from_uri(
    "https://wee.codes/data/gdp.csv",
    gdp_schema
).expect("CSV loading failed.");

This code calls the load_csv_from_uri function with the URI for our data as the first argument and our source schema as the second, unwraps the Result (with a helpful error message), and stores the resulting DataView in gdp_view.

The DataView object is the primary struct for working with data in the agnes library. Its functionality includes, but is not limited to: viewing data, extracting data, merging joining other datasources, and computing simple view statistics.

Now that we’ve loaded this data, it is simple to display it using typical Rust display semantics:

println!("{}", gdp_view);

The full code for our initial example is:

#[macro_use]
extern crate agnes;

use agnes::source::csv::load_csv_from_uri;

tablespace![
    table gdp {
        CountryName: String,
        CountryCode: String,
        Gdp2015: f64,
    }
];

fn main() {
    let gdp_schema = schema![
        fieldname gdp::CountryName = "Country Name";
        fieldname gdp::CountryCode = "Country Code";
        fieldname gdp::Gdp2015 = "2015";
    ];

    // load the GDP CSV file from a URI
    let gdp_view = load_csv_from_uri(
        "https://wee.codes/data/gdp.csv",
        gdp_schema
    ).expect("CSV loading failed.");

    println!("{}", gdp_view);
}

If your file looks like this, you should be able to type cargo run at your command prompt, and after compilation you’ll see your loaded data!

This example is available here. Additionally, you can find a version which loads the CSV file from a local path here.

Preprocessing

So far, we’ve loaded a data file and displayed it, but we haven’t really done any preprocessing work. Looking through the data we displayed in the last section, you may have noticed that this data set includes several aggregate categories for regions and income groups, such as “East Asia & Pacific”, “Upper middle income”, and “World”. While these categories may be useful in some contexts, we really only wanted to examine countries.

Looking at the full source data set, it doesn’t seem like there are any ways to easily filter out the categories from the countries. However, the World Bank website also provides metadata information for the GDP data, which has been rehosted here. This file contains the Country Codes reference in the primary GDP data file, along with their associated region and income group. Upon examination, you may notice that some of the regions and income groups are missing – specifically, all the “country codes” that are used to denote regions or income groups themselves.

Thus, we can come up with a preprocessing step to filter out the non-countries in our data set: filter the metadata so that only actual countries exist, and then perform a join of that metadata with the original data set (effectively filtering the original GDP data set to only include actual countries).

With this plan in mind, let’s translate this into programming tasks:

  • Define the GDP metadata table in our tablespace.
  • Write the source schema for loading the data from the file.
  • Load the file from URI.
  • Filter the loaded data set to ignore records without ‘Region’ data (therefore only include actual countries, not aggregates).
  • Join the filtered metadata with the GDP DataView we loaded in the last section.

Metadata Table Definition

For defining the metadata table, let’s augment our existing tablespace macro call. We’re only concerned with the CountryCode field (to be able to join with the GDP DataView), and the Region field (to filter upon).

tablespace![
    /* ... */
    table gdp_metadata {
        CountryCode: String,
        Region: String,
    }
]

Metadata Source Schema

Next, let’s write the source schema for the metadata file. We can add these lines to the main function:

let gdp_metadata_schema = schema![
    fieldindex gdp_metadata::CountryCode = 0usize;
    fieldname gdp_metadata::Region = "Region";
];

For demonstration purposes, we’re using the fieldindex specifier to indicate that the CountryCode data is found in the 0th (first) column of the spreadsheet. We could have equivalently used fieldname gdp_metadata::CountryCode = "Country Code";.

Load Metadata File

Loading the file works the same way as it did for loading the GDP data:

let mut gdp_metadata_view = load_csv_from_uri(
    "https://wee.codes/data/gdp_metadata.csv",
    gdp_metadata_schema
).expect("CSV loading failed.");

The only difference here is that we specified that this DataView is mutable – this will allow us to filter the view.

Data Filtering

Now, we’ll do something new: manipulating a DataView. The DataView type has a method called filter, which modifies the DataView by filtering out every record that returns false given a simple true-false predicate for a particular field.

You may recall that in agnes, field labels are themselves types. Therefore, to specify a field (or fields) in agnes, we often have to explicitly supply type arguments when calling methods on a DataView (or other agnes data structure) using what is often referred to as the “turbofish” syntax: object.method::<...>(...). The filter method on DataView is one such method. To call this method, we write:

let gdp_metadata_view =
    gdp_metadata_view.filter::<gdp_metadata::Region, _>(|val: Value<&String>| val.exists());

The filter takes two type parameters: the label we’re filtering (gdp_metadata::Region) and the type of the predicate we’re supplying. The compiler is smart enough to figued out the predicate type based on what we pass as an argument, so we can tell the compiler to figure it out by using the symbol _ as the second type parameter.

This code also introduces another agnes data type: the Value enum. The Value enum is quite similar to the standard Option enum – it has two variants: Value::Exists(...), which specifies that a value exists and provides the value, and Value::Na, which specifies that the value is missing. In the code above, the predicate is expected to take one argument: a Value type holding a reference to the data that exists in the gdp_metadata::Region field (in this case, String). It is typical that we will be dealing with reference-holding Value objects, since most operations we will perform will not require taking ownership of the data itself. To use the Value type in our code, though, we need to remember to include it in our imports at the top of our file:

use agnes::field::Value;

The predicate we provide, |val: Value<&String>| val.exists(), is fairly simple: only return true for non-missing values.

The filter method consumes the DataView and returns the filtered data, so we finish by simply assigning the result into a new variable named gdp_metadata_view, replacing the previous one. After applying this filter, feel free to try printing out the filtered metadata and check to see if the non-country records have been removed!

Joining GDP and Metadata Data

Now that we have a properly filtered metadata DataView, we can peform a join between the metadata and our original GDP data to effectively filter the non-country aggregates out of our GDP data. This takes a single line of code:

let gdp_country_view = gdp_view
    .join::<Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>, _, _>(&gdp_metadata_view);

Here, we’re again using the ‘turbofish’ syntax to provide type arguments to the join method. In this case, we need to provide the type of the Join struct which specifies how the join should operate. Like label structs, the Join struct is never intended to be instantiated: it’s a marker struct that exists just to tell the compiler the type of join and the fields the join is operating upon. In our care, we’re specifying that we with to perform an equality join (equijoin) on the gdp::CountryCode and the gdp_metadata::CountryCode fields: Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>.

To use Equal and Join, we make sure to import them at the top of the file:

use agnes::join::{Equal, Join};

The remaining two type parameters of join provide information about type of the DataView we’re joining (gdp_metadata_view) onto this first Dataview (gdp_view). Since we’re providing gdp_metadata_view as a method argument, the compiler knows what type it is and we can again tell the compiler to figure out the relevant types using the placeholder syntax _.

The join method takes both gdp_view and gdp_metadata_view as reference, and creates a new DataView object, not consuming or mutating either of the original DataView objects. It should be noted, however, that the join method does not copy any data; it only provides a new window into the data that was originally loaded from CSV files.

Now that we’ve joined these two tables, we can print out the results and see what we’ve done!

println!("{}", gdp_country_view);

The aggregate-based records are indeed gone. You may notice, however, that we have some unnecessary columns in our DataView now – we don’t need the CountryCode and Region columns that were added from the gdp_metadata table after our join. Let’s not worry about that for now, and come back to it in the next section.

Our code so far should look like (also viewable here):

#[macro_use]
extern crate agnes;

use agnes::field::Value;
use agnes::join::{Equal, Join};
use agnes::source::csv::load_csv_from_uri;

tablespace![
    table gdp {
        CountryName: String,
        CountryCode: String,
        Gdp2015: f64,
    }
    table gdp_metadata {
        CountryCode: String,
        Region: String,
    }
];

fn main() {
    let gdp_schema = schema![
        fieldname gdp::CountryName = "Country Name";
        fieldname gdp::CountryCode = "Country Code";
        fieldname gdp::Gdp2015 = "2015";
    ];

    // load the CSV file from a URI
    let gdp_view =
        load_csv_from_uri("https://wee.codes/data/gdp.csv", gdp_schema).expect("CSV loading failed.");

    let gdp_metadata_schema = schema![
        fieldindex gdp_metadata::CountryCode = 0usize;
        fieldname gdp_metadata::Region = "Region";
    ];

    let mut gdp_metadata_view = load_csv_from_uri(
        "https://wee.codes/data/gdp_metadata.csv",
        gdp_metadata_schema
    ).expect("CSV loading failed.");

    let gdp_metadata_view =
        gdp_metadata_view.filter::<gdp_metadata::Region, _>(|val: Value<&String>| val.exists());

    let gdp_country_view = gdp_view
        .join::<Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>, _, _>(&gdp_metadata_view);

    println!("{}", gdp_country_view);
}

Adding Life Expectancy

Thus far, we’ve loaded up GDP data and successfully filtered out a bunch of unnecessary records. Next, let’s combine it with life expectancy data.

We should have a good idea of how to proceed at this point – we have a new data file to load, so we will need to define the table in our tablespace, write another source schema, and call load_csv_from_uri to load the data. After that, is should be as simple as just joining the life expectancy and GDP views together to get us a combined data view!

So, let’s define our table (in the same tablespace macro we’ve been using):

tablespace![
    /* ... */
    pub table life {
        CountryCode: String,
        Year2015: f64,
    }
]

And write our source schema:

let life_schema = schema![
    fieldname life::CountryCode = "Country Code";
    fieldname life::Life2015 = "2015";
];

And load the file from a URI:

let life_view = load_csv_from_uri(
    "https://wee.codes/data/life.csv",
    life_schema
).expect("CSV loading failed.");

These should all be fairly recognizable at this point; they’re nearly identical to the source loading we did for the GDP and GDP metadata files.

Joining the GDP and life expectancy data views should also be familiar:

let gdp_life_view = gdp_country_view
    .join::<Join<gdp::CountryCode, life::CountryCode, Equal>, _, _>(&life_view);

We can now print it out and take a look!

println!("{}", gdp_life_view);

It seems to have worked! But now, we’ve exacerbated our problem with the extra columns. We really only care about the country name, 2015 GDP, and 2015 life expectancy fields. We have three country code fields and a region field we don’t need anymore!

To fix this, we introduce another DataView method: v (which is shorthand for subview). This method will take a DataView and construct another DataView which only contains a subset of original fields, and we call it like this:

let gdp_life_view = gdp_life_view
        .v::<Labels![gdp::CountryName, gdp::Gdp2015, life::Life2015]>();

We’re again using the ‘turbofish’ syntax to specify type arguments to the v method. In this case, the method takes a list of labels instead of a single label (like we specified in filter or join). agnes provides the Labels macro to construct this label list, which we use to specify that we only want the CountryName and Gdp2015 fields originally from the gdp table, and the Life2015 field from the life table. The v method doesn’t consume the original DataView, but since we no longer need it, we go ahead and store the resultant DataView with the same name, shadowing the original view.

Now, when we print gdp_life_view, we get a much less cluttered data table.

That’s it for this step! We’ve now added life expectancy data to our DataView and removed extraneous columns. Our code should generally look like this (also viewable here):

#[macro_use]
extern crate agnes;

use agnes::field::Value;
use agnes::join::{Equal, Join};
use agnes::source::csv::load_csv_from_uri;

tablespace![
    table gdp {
        CountryName: String,
        CountryCode: String,
        Gdp2015: f64,
    }
    table gdp_metadata {
        CountryCode: String,
        Region: String,
    }
    pub table life {
        CountryCode: String,
        Life2015: f64,
    }
];

fn main() {
    let gdp_schema = schema![
        fieldname gdp::CountryName = "Country Name";
        fieldname gdp::CountryCode = "Country Code";
        fieldname gdp::Gdp2015 = "2015";
    ];

    // load the GDP CSV file from a URI
    let gdp_view =
        load_csv_from_uri("https://wee.codes/data/gdp.csv", gdp_schema).expect("CSV loading failed.");

    let gdp_metadata_schema = schema![
        fieldindex gdp_metadata::CountryCode = 0usize;
        fieldname gdp_metadata::Region = "Region";
    ];

    // load the metadata CSV file from a URI
    let mut gdp_metadata_view =
        load_csv_from_uri("https://wee.codes/data/gdp_metadata.csv", gdp_metadata_schema)
            .expect("CSV loading failed.");

    let gdp_metadata_view =
        gdp_metadata_view.filter::<gdp_metadata::Region, _>(|val: Value<&String>| val.exists());

    let gdp_country_view = gdp_view
        .join::<Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>, _, _>(&gdp_metadata_view);

    let life_schema = schema![
        fieldname life::CountryCode = "Country Code";
        fieldname life::Life2015 = "2015";
    ];

    /// load the life expectancy CSV file from a URI
    let life_view = load_csv_from_uri(
        "https://wee.codes/data/life.csv",
        life_schema
    ).expect("CSV loading failed.");

    let gdp_life_view = gdp_country_view
        .join::<Join<gdp::CountryCode, life::CountryCode, Equal>, _, _>(&life_view);

    let gdp_life_view = gdp_life_view
        .v::<Labels![gdp::CountryName, gdp::Gdp2015, life::Life2015]>();

    println!("{}", gdp_life_view);
}

Arithmetic Transformation

Our final preprocessing task will be a arithmetic transformation of one of the fields. Let’s say that our downstream code expects the GDP to be measured in euros, not US dollars. Thus, as post of our preprocessing tasks we need to do a quick conversion.

This step introduces a few additional agnes features. We’ll start by adding the appropriate trait imports:

use agnes::select::FieldSelect;
use agnes::access::DataIndex;
use agnes::label::IntoLabeled;
use agnes::store::IntoView;

FieldSelect is a trait for selecting a single field a DataView, DataIndex is a trait for accessing individual data in a field, IntoLabeled is a trait for adding a label to an unlabeled data field, and IntoView is a trait for turning a field (or other data structure) into a DataView.

Just from this list, it might start to become apparent what our plan is going to be: we’ll select the current GDP field from our DataView, access the data, generate a new transformed field, label this new field, convert it into a new DataView, and then merge the two DataViews using the DataView merge method.

One concern is that we need to be able to label this new field, but don’t have any labels to apply to it. Fortunately, we can use the same method for declaring new labels as we did when declaring the labels for tables we load: the tablespace macro! So let’s add the following to our tablespace call:

tablespace![
    /* ... */
    pub table transformed {
        GdpEUR2015: f64,
    }
]

The table name transformed is arbitrary, and just provides a place to define our newly transformed data field’s label.

Let’s also define a quick conversion function for converting from USD to EUR. For the sake of simplicity, we’re just going to hard-code the conversion factor, but we could load this from a file, or read it from an API, or request it from the user, or any other method we can come up with.

At the time of the writing of this guide, 1 USD = 0.88395 EUR. Thus, our simple hard-coded conversion function is:

fn usd_to_eur(usd: &f64) -> f64 {
    usd * 0.88395
}

Now we can dive into creating the transformed data field:

let transformed_gdp = gdp_life_view
    .field::<gdp::Gdp2015>()
    .iter()
    .map_existing(usd_to_eur)
    .label::<transformed::GdpEUR2015>()
    .into_view();

This statement starts with our combined GDP-life expectancy DataView, and selects a single field Gdp2015 using the field method provided by the FieldSelect trait. Then we create an iterator over the data in thie field using the iter method (provided by the DataIndex trait). This iterator provides the method map_existing which applies a function or closure to every existing element in the field (leaving missing data as missing), which we call with our conversion function.

Next, we label this transformed data with our new field label, and convert the field (with into_view) into a new DataView object, which is needed to be able to merge it with our existing DataView:

let final_view = gdp_life_view
    .merge(&transformed_gdp)
    .expect("Merge failed.")
    .v::<Labels![gdp::CountryName, transformed::GdpEUR2015, life::Life2015]>();

Here, we merge our new single-field DataView containing our transformed data back on to our DataView with GDP and life expectancy data. The merge method basically adds all the fields in transformed_gdp onto gdp_life_view and returns a new combined DataView. Merging can fail if you try to merge two DataViews with different numbers of rows, but that shouldn’t be a problem here since we just defined this new field and know it has the correct number of rows.

Finally, since we don’t need the original Gdp2015 field anymore, we perfrom another subview operation, only choosing the CountryName, GdpEUR2015, and Life2015 columns.

Final Preprocessor

We’ve done it! We now have a preprocessor application which takes original source GDP and life expectancy data, removes unnecessary records and columns, combines our data sources into a single view of the data, and performs a minor arithmetic transformation on one of the fields. Now we can serialize this data into whatever format we need for downstream activites: visualization, regression, storage, whatever!

Our final application code should look like this (also viewable here):

#[macro_use]
extern crate agnes;

use agnes::access::DataIndex;
use agnes::field::Value;
use agnes::join::{Equal, Join};
use agnes::label::IntoLabeled;
use agnes::select::FieldSelect;
use agnes::source::csv::load_csv_from_uri;
use agnes::store::IntoView;

tablespace![
    table gdp {
        CountryName: String,
        CountryCode: String,
        Gdp2015: f64,
    }
    table gdp_metadata {
        CountryCode: String,
        Region: String,
    }
    pub table life {
        CountryCode: String,
        Life2015: f64,
    }
    pub table transformed {
        GdpEUR2015: f64,
    }
];

fn usd_to_eur(usd: &f64) -> f64 {
    usd * 0.88395
}

fn main() {
    let gdp_schema = schema![
        fieldname gdp::CountryName = "Country Name";
        fieldname gdp::CountryCode = "Country Code";
        fieldname gdp::Gdp2015 = "2015";
    ];

    // load the GDP CSV file from a URI
    let gdp_view =
        load_csv_from_uri("https://wee.codes/data/gdp.csv", gdp_schema).expect("CSV loading failed.");

    let gdp_metadata_schema = schema![
        fieldindex gdp_metadata::CountryCode = 0usize;
        fieldname gdp_metadata::Region = "Region";
    ];

    // load the metadata CSV file from a URI
    let mut gdp_metadata_view =
        load_csv_from_uri("https://wee.codes/data/gdp_metadata.csv", gdp_metadata_schema)
            .expect("CSV loading failed.");

    let gdp_metadata_view =
        gdp_metadata_view.filter::<gdp_metadata::Region, _>(|val: Value<&String>| val.exists());

    let gdp_country_view = gdp_view
        .join::<Join<gdp::CountryCode, gdp_metadata::CountryCode, Equal>, _, _>(&gdp_metadata_view);

    let life_schema = schema![
        fieldname life::CountryCode = "Country Code";
        fieldname life::Life2015 = "2015";
    ];

    // load the life expectancy file from a URI
    let life_view = load_csv_from_uri("https://wee.codes/data/life.csv", life_schema)
        .expect("CSV loading failed.");

    let gdp_life_view =
        gdp_country_view.join::<Join<gdp::CountryCode, life::CountryCode, Equal>, _, _>(&life_view);

    let gdp_life_view =
        gdp_life_view.v::<Labels![gdp::CountryName, gdp::Gdp2015, life::Life2015]>();

    let transformed_gdp = gdp_life_view
        .field::<gdp::Gdp2015>()
        .iter()
        .map_existing(usd_to_eur)
        .label::<transformed::GdpEUR2015>()
        .into_view();

    let final_view = gdp_life_view
        .merge(&transformed_gdp)
        .expect("Merge failed.")
        .v::<Labels![gdp::CountryName, transformed::GdpEUR2015, life::Life2015]>();

    println!("{}", final_view);
}