Data Integration: Remaining Challenges and Research Paths

Data integration (DI) has been a cornerstone of computer science research for decades, resulting in a few established reference architectures. They generally fall into three categories: virtual (federated and mediated), physical (data warehouse), and hybrid (data lake, data lakehouse, and data mesh). Regardless of the paradigm, these architectures depend on an integration layer, implemented by means of sophisticated software designed to orchestrate and execute DI processes. The integration layer is responsible for ingesting data from various sources (typically heterogeneous and distributed) and for homogenizing data into formats suitable for future processing and analysis. On the one hand, in all business domains, large volumes of highly heterogeneous data are produced, e.g., medical systems, smart cities, smart agriculture, which require further advancements in the data integration technologies. On the other hand, the widespread adoption of artificial intelligence (AI) solutions is now extending towards DI, offering alternative solutions, opening new research paths, and generating new open problems. Emerging paradigms, such as Data Spaces and the Model Context Protocol, further advance DI.

In this talk, I will: (1) overview the research field of DI, (3) highlight remaining challenges, and (3) outline ML/AI solutions for DI. The findings presented in the talk are based on my experience in running research and development DI projects for various business entities.

Bio

Robert Wrembel (PhD, Dr. Habil.) is a professor in the Faculty of Computing and Telecommunications at Poznan University of Technology (PUT), Poland. He received his habilitation in 2008, specializing in database systems and data warehouses. His primary research includes data integration, data quality, databases, data warehouses, and data lakes. He held a few administrative roles at PUT, including two terms as deputy dean of the Faculty of Computing and Management (2008–2012) and the Faculty of Computing (2012–2016). Since Jan 2023, he has chaired the Data Processing Technologies research group at PUT.

From 2023 to 2024, he led the Interdisciplinary Centre for Artificial Intelligence and Cybersecurity in Poznań. His business career includes also serving as a lecturer for Oracle Poland (1998–2005) and a consultant for Rodan Systems (2002–2003). He currently provides expert IT consultancy for a private hospital. Over the past decade, he has led four R&D projects for industry, including Samsung Electronics, PKO BP (the largest Polish bank), and Kogeneracja Zachód (company in the energy sector).

At PUT, he currently leads two EU-funded projects focused on the integration of IT and AI in smart agriculture: the Chist-Era Project (2023 Call) and the HORIZON-MSCA-2024 Doctoral Network. His work involves close collaboration with the IBM Software Lab in Kraków. Additionally, from 2013 to 2020, he served as the university lead for the Erasmus Mundus Joint Doctorate Program in Information Technologies for Business Intelligence.

Robert’s academic outreach includes visits to several research institutions, including: Free University of Bolzano-Bozen (Italy), INRAE (France), Universitá degli Studi di Milano (Italy), Universitat Politècnica de Catalunya - BarcelonaTech (Spain), Université Lyon 2 (France), Universidad de Costa Rica (Costa Rica), Klagenfurt University (Austria), Loyola University (USA), INRIA Paris-Rocquencourt (France), and Université Paris Dauphine (France).

He is a graduate of the “TOP 500 Innovators” program (2012), run at Stanford University, USA and he is a former intern at BI company TARGIT, USA (2013). He is a senior member of the ACM, a member of the Committee of Informatics of the Polish Academy of Sciences (PAS), and a chair of the Data Engineering research group at the PAS. He is actively involved in a few committees at the International Federation of Information Processing (IFIP): (1) the representative of the PAS in the IFIP General Assembly, (2) a country representative in the IFIP Technical Committee TC 2 - Software: Theory and Practice, and (3) a chair the of the IFIP Working Group 2.6 - Database.

Webpage: http://www.cs.put.poznan.pl/rwrembel/

DBLP publication record: https://dblp.org/pid/41/3391.html

Updated: