Skip to main content

Introduction to data validation

Data validation is in essence about performing data unit testing (similar to code unit tests) so that you know when the data is not matching your expectations. The more granular unit testing, the more granular will your understanding of the data quality issue at hand be when things don’t match up. Through Rebase Vaildators, users can specify validators (sometimes called constraints or expectations) on their data that can be used to trigger alerts and monitor data quality over time. It enables users, for instance, to:
  • Detect data quality issues such as missing values, flatliners and wrong formats
  • Detect data (distribution) drift
  • Detect concept drift and/or model performance degradation
  • Detect anomalies such as under-performing energy assets or malfunctioning sensors

Defining and running data validations

Validations and validation suites are created through the rb.Validator object. You can either use one of the pre-defined validations or you can define your own custom validation. Here is a simple example to check if a specific column is within a pre-defined bound:
import rebase as rb

df = pd.read_csv("your_csv_file.csv")

validator = rb.Validator()
validator.column_between_values(column="power_production", low=0, high=100)

result = validator.run(df)
Several validations can be packaged into a validation suite by adding them after each other:
import rebase as rb

df = pd.read_csv("your_csv_file.csv")

validator = rb.Validator()
validator.column_between_values(column="power_production", low=0, high=100)
validator.column_not_nan(column="power_production")
validator.column_not_flatline(column="power_production")

result = validator.run(df)

Creating custom data validations

import rebase as rb

df = pd.read_csv("your_csv_file.csv")

class Validator(rb.Validator): 
    def check_if_float_column_contains_only_integers(column): 
        check = df[column].apply(float.is_integer).all()

        return check

validator = Validator()
validator.check_if_float_column_contains_only_integers(column="power_production", low=0, high=100)

result = validator.run(df)