Till Döhmen (Frauhofer FIT), Mark Raasveldt (CWI), Hannes Muhleisen (CWI), Sebastian Schelter (UvA)

Abstract

Data quality validation plays an important role in ensuring the proper behaviour of productive machine learning (ML) applications and services. Observing a lack of existing solutions for quality control in medium-sized production systems, we developed DuckDQ: A lightweight and efficient Python library for data quality validation, that seamlessly integrates with scikit-learn ML pipelines and does not require a distributed computing environment or ML platform infrastructure, while outperforming existing solutions by a factor 3 to 40 in terms of runtime. We introduce the notion of data quality assertions, which can stop a pipeline when quality constraints of the input data or the model’s output are not met. Furthermore, we employ stateful metric computations, which greatly enhance the possibilities for post hoc failure analysis and drift detection, even when the serving data is not around anymore.

pdf

github