Skip to main content

Introduction to data pipelines

Data pipelines consist of consecutive steps (more exactly directed ascyclic graphs or DAGs) that all serves a specific purpose. For instance, an energy forecasting a pipeline could consist of a data loading step, a data preprocessing step, a prediction step and a data saving step. There are several reasons why it makes sense to split your code into a pipeline consisting of several steps:
  • The code becomes more structured and readable
  • Get a better code overview through pipeline visualization
  • It becomes easier to localize errors and debug
  • The code becomes more modular and resuable

Defining data pipelines

Rebase Pipelines enables users to define executable pipelines only using a couple of decorators. A pipeline combines several steps, representing individual tasks, to create workflows. Here is a example of a simple pipeline that consists of two steps chained together:
# pipeline.py
from rebase.pipeline import pipeline, step

@step
def step1(a: float, b: float) -> float:
    return a**2+b**2

@step
def step2(c2: float) -> float:
    return c2**(1/2) 

@pipeline
def calculate_hypotenuse(a: float, b: float) -> float:
    c2 = step1(a, b)
    c = step2(c2)
    return c

if __name__ == "__main__":
    c = calculate_hypotenuse(a=3, b=4)
    print(f"The hypotenuse is: {c}")
The @step and @pipeline decorators are used to turn a regular Python function into a step and pipeline respectively.

Running data pipelines locally

A pipeline can be executed locally through the command line through:
python pipeline.py
To run the pipeline programmatically (e.g. in a notebook) simply run the Python function that defines the pipeline:
c = pipeline(a=3, b=4)

Running data pipelines remotely

To run a pipeline remotely you simply replace python with rb run in the command line:
rb run pipeline.py
or call programmatically
rb.run(pipeline=calculate_hypotenuse, inputs={"a": 3, "b": 4})

Creating reproducible pipeline runs

One of the main benefits of using Rebase Pipelines is that reproducibility comes packaged in. In essence, reproducibility is about coherently versioning your code, data (inputs and outputs) as well as your model artifacts in one place so that you (or someone else) are able to reproducible the same results at a later time.