Mats Eikeland Mollestad | How I Accidentally Created the “JDSL” of Data Pipelines

How I Accidentally Created the “JDSL” of Data Pipelines - And It's Awesome

14 Jan 2024

JSON Data Pipeline

I ended up creating a data pipeline tool where everything is programmed in JSON, becoming the “JDSL” of data processing. And I promise, it’s actually quite awesome!

But Why Did I Create This?

The short answer, I just felt like it. Because why not? It was a crazy challange after all.

However, the motivation for such a system was based on some work at Otovo where I was the ML Engineer, responsible for creating ML models that operated on batch and streaming data.

However, we hadn’t implemented any robust tools for creating data pipelines, and I was the only one tasked with addressing this issue. This might have been the point where I should have opted for a well-known data processing tool like Spark or Flink, but then came the challenge of training-serving skew in ML.

This skew occurs when an ML model, trained on one machine, runs on another, potentially using a different programming language. This means data transformations need to be replicated. For example, a training job might process data in SQL and Python, while production might use Scala and Python. Replicating complex mathematical functions can quickly lead to slightly different results, rendering the ML model ineffective and essentially useless.

Therefore, I needed a way to ensure the transformations were aligned on every machine, regardless of the programming language or version they were running on.

The Crazy Idea

This is where I had my crazy idea. As each line of data transformation can be represented with very little information. For instance, if we want to multiply the “amount” column by 10, we can represent a “multiply column with constant value” transformation and provide the column name (“amount”) and value (10). The same concept applies to adding two columns - we just need to know which two columns to use. Or, for something more advanced, like generating an embedding, we need to know which embedding model to use, and for which column.

In this way, each transformation would be technically agnostic, and therefore shareable across machines.

With this in mind, I needed a way to describe these transformations. This led to the use of JSON. Meaning, we actually program ETL pipelines by defining JSON! And suddenly, we ended up as the JDSL of data pipelines.

{
    "name": "is_female",
    "transformation": {
        "key": "sex", 
        "name": "equals", 
        "value": {"name": "string", "value": "female"}
    },
    "depending_on": [
        {"name": "sex", ...}
    ],
    ...
}

From here I could load the file into Python and run my transformations with my processing engine of choice! Not bad!

from aligned import FileSource

transformations = await FileSource.json_at(
    "my/transformations/v1.json"
).feature_store()

input_data = {
    "passenger_id": [1, 2, ...]
    "age": [10, 20, ...],
    "sex": ["male", "female", ...]
}
transformed_data = (await transformations.feature_view("titanic")
    .process_input(input_data)
    .to_polars()
)

JSON Was a Bit Too Crazy

However, programming directly in JSON is not practical. It would lead to many frustrating formatting errors. There was no linting to check that the file was semantically correct, and no code completion to indicate which parameters should be set.

Therefore, I also set up a Python API that provides type-safe transformations, code completion, and then compiles everything to the JSON file.

from aligned import Int32, Float, String, feature_view

@feature_view(...)
class TitanicPassenger:    
    passenger_id = Int32()
        
    sex = String().accepted_values(["male", "female"])
    is_male, is_female = sex.one_hot_encode(['male', 'female'])
    
    age = Float().description("Contains some 1.2 values, so needs to be a float")
    
    sibsp = Int32().description("Number of siblings on the Titanic")
    has_siblings = sibsp > 0

Notice that we are not referencing columns using strings, but rather using the fields themselves! This leads to both code completion and “type-safety”, as linters can catch errors.

It Was Way Better Than What I Thought

Although the JSON file contains a lot of information, it is still just a file. So, what real use could it have? However, after some time, I realized that its value depended on the context in which it was used. Now, the file has become my most valuable asset! As it powered all the functions bellow.

Data Validation (both data types and semantic checks)
Data Lineage Graphs

Data Catalogs

Data Freshness Checks
Debug spesific feature transformations

Continuing Pipelines from Cached States
Incremental Data Materialization
Warning about Data Migration Conflicts

And that is only the data engineering features, and not taking into account the MLOps features.

But there’s so much more! Having a file that contains all these transformations, input features, etc., in a technology-agnostic way has led to an incredibly flexible system. It has enabled me to move much faster than I previously thought possible.

So even though this project was initially a fun crazy challenge. Have this been one of the most fun and exciting projects as well!

So why not give aligned a try and see what you think? Or check out the data catalog!