Introduction to data validation
Data validation is in essence about performing data unit testing (similar to code unit tests) so that you know when the data is not matching your expectations. The more granular unit testing, the more granular will your understanding of the data quality issue at hand be when things don’t match up. Through Rebase Vaildators, users can specify validators (sometimes called constraints or expectations) on their data that can be used to trigger alerts and monitor data quality over time. It enables users, for instance, to:- Detect data quality issues such as missing values, flatliners and wrong formats
- Detect data (distribution) drift
- Detect concept drift and/or model performance degradation
- Detect anomalies such as under-performing energy assets or malfunctioning sensors
Defining and running data validations
Validations and validation suites are created through therb.Validator object. You can
either use one of the pre-defined validations or you can define your own custom validation.
Here is a simple example to check if a specific column is within a pre-defined bound: