Mariam Barry
Adaptive Scalable Online Learning for Handling Heterogeneous Streaming Data in Large-Scale Banking Infrastructure
In this thesis, we have addressed different algorithmic and infrastructure challenges faced when dealing with online machine learning capabilities over high-volume data streams from heterogeneous sources. The research encompasses big data summarization, the construction of industrial knowledge graphs dynamically updated, online change detection, and the operationalization of streaming models in production. Initially, we introduced StreamFlow, an incremental algorithm and a system for big data summarization, generating feature vectors suitable for both batch and online machine learning tasks. These enriched features significantly enhance the performance of both time and accuracy for training batch and online machine-learning models. Subsequently, we proposed Stream2Graph, a stream-based solution facilitating the dynamic and incremental construction and updating of enterprise knowledge graphs. Experimental results indicated that leveraging graph features in conjunction with online learning notably enhances machine learning outcomes. Thirdly, we presented StreamChange, an explainable online change detection model designed for big data streaming, featuring constant space and time complexity. Real-world experiments demonstrated superior performance compared to state-of-the-art models, particularly in detecting both gradual and abrupt changes. Lastly, we demonstrated the operationalization of online machine learning in production, enabling horizontal scaling and incremental learning from streaming data in real-time. Experiments utilizing feature-evolving datasets with millions of dimensions validated the effectiveness of our MLOps pipelines. Our design ensures model versioning, monitoring, audibility, and reproducibility, affirming the efficiency of employing online learning models over batch methods in terms of both time and space complexity.