Uncertainty in the Data Engineering Pipeline: From Imputation to Explanation

Machine learning pipelines are commonly evaluated on model accuracy, yet much of the uncertainty that shapes real-world outcomes originates not in the model itself but in the data engineering decisions that precede it. This talk examines two manifestations of that upstream uncertainty and their downstream consequences for trustworthy AI.

The first concerns missing value imputation. Real-world missingness is rarely as clean as Rubin’s classic MCAR/MAR/MNAR taxonomy assumes: datasets exhibit multi-mechanism missingness and missingness shift between training and deployment. Drawing on the Shades-of-Null evaluation suite and the VirnyFlow framework, I will show how imputation strategies induce variation in predictive accuracy, model stability, and fairness across demographic groups, and that these dimensions do not move together. No single best imputer exists; the right choice depends on the missingness regime, the model class, and the stakeholder context.

The second manifestation concerns post-hoc explanations. SHAP-based feature attributions are increasingly used to justify decisions and satisfy regulatory requirements, yet they are surprisingly fragile. Routine transformations such as bucketizing a continuous feature or encoding a categorical one can drastically shift which features are deemed most important, opening the door to inadvertent or adversarial manipulation. Compounding this, explanation multiplicity is the phenomenon of substantial disagreement in feature attributions across repeated runs of the same pipeline, with the model and input held fixed. It is widespread even for high-confidence predictions, and is systematically masked by commonly used metrics.

Together, these results argue for treating data engineering as a first-class source of uncertainty in responsible ML, and for evaluation frameworks that make this uncertainty visible, measurable, and governable.

Bio

Dr. Julia Stoyanovich is Institute Associate Professor of Computer Science and Engineering, Associate Professor of Data Science, and Director of the Center for Responsible AI (r-ai.co) at New York University. Her mission is to make responsible AI synonymous with AI. She pursues this goal through academic research, education, technology policy, and public engagement. Her research spans data management and AI systems, as well as the ethics and governance of AI. In addition to academic publications, Julia has written for press outlets including the New York Times, the Wall Street Journal, and Le Monde. She holds an M.S. and Ph.D. in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts Amherst. She is a recipient of the Presidential Early Career Award for Scientists and Engineers (PECASE) and a Senior Member of the Association for Computing Machinery (ACM).