Nils Holzenberger – Page 2 – Data, Intelligence & Graphs Team

Tuesday, May 27, 2025, 11:45, 4A301

Maximilian Egger

Robust Knowledge Graph Cleaning

Data quality is needed to properly and reliably use the information represented in the dataset. The increasing volume of data renders data preparation and cleaning increasingly difficult. Additionally, more diverse types of data structures for databases, like graphs, get used and need to be handled differently. This leads to the necessity of robust methods to increase data integrity, scalable approaches for finding and fixing errors, and local-oriented algorithms that can be used to pinpoint attention where needed.

This talk provides an overview of my past, present, and future projects on knowledge graphs, exploring their potential for improving data cleanliness and robustness.

Tuesday, May 13, 2025, 11:45, 4A301

Nikola Simidjievski

Synthesis & Augmentation of Tabular Data In the Age of Foundation Models

Foundation models – large pre-trained performant models – have shown remarkable success in applications that predominately focus on vision, language, and sound data. On the other hand, tabular data – one of the most prevalent data modalities in many critical domains of business, science, and healthcare – has seen limited benefits from these advances. Tabular data poses unique challenges that relate to heterogeneity, dimensionality, and scarcity as well as lack of explicit symmetries, implicit structures and incomplete prior knowledge — all of which have limiting effects on how we construct, train and apply/transfer large models for tabular data.

Data synthesis is one of the remedies for overcoming some of these challenges: It can help improve model performance in data-scarce but critical applications, but it can also be utilized as a data augmentation mechanism for training more robust models. Although previous research has sought to adapt the successes of generative modeling of homogeneous modalities to tabular modalities, defining an effective generator for tabular data remains an open problem. In this talk, I will present several novel data-centric approaches for data synthesis that focus on tabular data. Our key innovation is transforming recent pre-trained tabular classifiers into data generators and leveraging their learned information in the input and manifold space. These methods are fast, require no additional training, and can be applied to any downstream predictive model. They consistently improve performance, especially on small datasets where training well-performing models is hard. Consequently, we also uncover several properties and benefits that can help in the way how we design robust and performant general-purpose tabular foundation models.

Tuesday, April 29, 2025, 11:45, 4A301

Simon Razniewski (TU Dresden)

GPTKB: Comprehensively Materializing Factual LLM Knowledge

LLMs have majorly advanced NLP and AI, and next to their ability to perform a wide range of procedural tasks, a major success factor is their internalized factual knowledge. Since (Petroni et al., 2019), analyzing this knowledge has gained attention. However, most approaches investigate one question at a time via modest-sized pre-defined samples, introducing an “availability bias” (Tversky and Kahneman, 1973) that prevents the discovery of knowledge (or beliefs) of LLMs beyond the experimenter’s predisposition. To address this challenge, we propose a novel methodology to comprehensively materialize an LLM’s factual knowledge through recursive querying and result consolidation. As a prototype, we employ GPT-4o-mini to construct GPTKB, a large-scale knowledge base (KB) comprising 101 million triples for over 2.9 million entities. This work marks a milestone in two areas: For LLM research, for the first time, it provides constructive insights into the scope and structure of LLMs’ knowledge (or beliefs), and its strengths and weaknesses. For KB construction, it pioneers new pathways for the long-standing challenge of general-domain KB construction. GPTKB is accessible at https://gptkb.org.

Tuesday, April 8, 2025, 11:45, 4A125

Pratik Karmakar

ProvSQL: Provenance and Probabilistic Querying in Uncertain Databases

Probabilistic databases provide a powerful framework for managing and querying uncertain data, enabling principled reasoning under uncertainty. ProvSQL extends PostgreSQL to support provenance tracking and probability computation in probabilistic databases, leveraging provenance circuits to efficiently compute probabilities and Shapley-based data valuations. In this talk, we introduce ProvSQL, demonstrate its capabilities, and explore a key use case—content based image retrieval from the COCO dataset. We show how probabilistic query evaluation and data valuation techniques enhance explainability and trust in AI-driven decision-making.

Tuesday, March 25, 2025, 11:45, 4A301

Gaël Varoquaux (INRIA)

Tabular foundation models: priors for numbers and strings

Deep-learning typically does not outperform tree-based models on tabular data. Often this may be explained by the small size of such datasets. For images, sound, text, the solution has be pretrained models, leading to foundation models, adapted and reused for many tasks. I will discuss the challenges to bring these ideas to tabular learning, and the progress that we have made, building priors for tables, ie columns of different natures, with numbers and strings.

Tuesday, March 18, 2025, 11:45, 4A301

Pierre Monnin (INRIA)

Neuro-symbolic approaches for the knowledge graph lifecycle

In the Web of Data, an increasing number of knowledge graphs (KGs) are concurrently published, edited, and accessed by human and software agents. Their wide adoption makes essential the tasks of their lifecycle: construction, refinement (e.g., matching, link prediction), mining, and usage to support applications (e.g., explainable AI, recommender systems). However, all these tasks require facing the inherent heterogeneity of KGs, e.g., in terms of granularities, vocabularies, and completeness. Besides, scalability issues arise due to their increasing size and combinatorial nature. In my talk, I will present my research on neuro-symbolic approaches for the KG lifecycle, intertwining domain knowledge from ontologies, deductive reasoning, analogical reasoning, and machine learning models. Throughout my presentation, I will show that such approaches enhance models by improving their semantic awareness, frugality, and the semantic interpretability of their latent representation space.

Tuesday, March 4, 2025, 11:45, 4A301

Ken Satoh (National Institute of Informatics, Japan)

Translating German traffic cases into logical rules

This is a joint work with May Myo Zin at my center and Georg Borgess at University of Saarland. In this talk, I will report the work on extracting normative sentences from German traffic cases and translating them into logical rules. The development of autonomous vehicles (AVs) requires a comprehensive understanding of both explicit and implicit traffic rules to ensure legal compliance and safety. While explicit traffic laws are well-defined in statutes and regulations, implicit rules derived from judicial interpretations and case law are more nuanced and challenging to extract. This research firstly investigates the potential of Large Language Models (LLMs), particularly GPT-4o, in automating the extraction of implicit traffic normative sentences from judicial decisions. Then we investigate how to translate these normative sentences into a logical form. We explore to use large language models (LLMs) to automate the translation of traffic rules into PROLOG, a declarative programming language ideal for encoding logical rules and relationships. The proposed methodology consists of three key phases: extracting traffic rules from diverse textual sources, structuring them into Logical English (LE) for clarity and consistency, and translating them into PROLOG representations using advanced natural language processing (NLP) techniques, including in-context learning and fine-tuning. The experimental results demonstrate the effectiveness of LLMs in automating this process, achieving high accuracy in translation.

Tuesday, February 4, 2025, 11:45, 4A125

Fabian Suchanek

YAGO

In this talk I will present the newest version of YAGO, the knowledge base that we are building with several members of the DIG team. I will show why we build it, how we build it, and how it can be used. This will also be an occasion for me to get your feedback on our work.

Tuesday, January 21, 2025, 11:45, 4A301

Simon Delarue

Learning on graphs: from algorithms to socio-technical analyses on AI

This thesis addresses the dual challenge of advancing Artificial Intelligence (AI) methods while critically assessing their societal impact. With AI technologies now embedded in high-stake decision sectors like healthcare and justice, their growing influence demands thorough examination, reflected in emerging international regulations such as the AI Act in Europe. To address these challenges, this work leverages attributed-graph based methods and advocates for a shift from performance-focused AI models to approaches that also prioritise scalability, simplicity, and explainability.

The first part of this thesis develops a toolkit of attributed graph-based methods and algorithms aimed at enhancing AI learning techniques. It includes a software contribution that leverages the sparsity of complex networks to reduce computational costs. Additionally, it introduces non-neural graph models for node classification and link predictions tasks, showing how these methods can outperform advanced neural networks while being more computationally efficient. Lastly, it presents a novel pattern mining algorithm that generates concise, human-readable summaries of large networks. Together, these contributions highlight the potential of these approaches to provide efficient and interpretable solutions to AI’s technical challenges.

The second part adopts an interdisciplinary approach to study AI as a socio-technical system. By framing AI as an ecosystem influenced by various stakeholders and societal concerns, it uses graph-based models to analyse interactions and tensions related to explainability, ethics, and environmental impact. A user study explores the influence of graph-based explanations on user perceptions of AI recommendations, while the building and analysis of a corpus of AI ethics charters and manifestos quantifies the roles of key actors in AI governance. A final study reveals that environmental concerns in AI are primarily framed technically, highlighting the need for a broader approach to the ecological implications of digitalisation.

Tuesday, December 10, 2024, 11:45, 4A125

Lanfang Kong

Explainable algorithms for anomaly detection and time series forecasting

Artificial intelligence has shown dominant performance across diverse domains, including critical ones such as medicine, finance, justice and so on. As a result, the explainability of black-box models is becoming more and more important. We focus on two specific applications: anomaly detection and time series forecasting, and present XTREK and ADAPATCH, respectively.

XTREK is an unsupervised tree-based approach for explainable anomaly detection, which maximizes Kendall’s tau between the anomaly scores of the source anomaly detector and those of XTREK. The tree produced by our algorithm is relatively small in size, thereby boasting the renowned off-the-shelf transparency and explainability of tree-based approaches. Moreover, its explanations are sample-based. In particular, the anomaly scores are computed to be the inverse of the size of the corresponding leaf, thereby providing meaningful explanations when comparing examples with different anomaly scores. XTREK can also be used as an in-model approach, which is capable of providing concise explanations for its own decisions. Moreover, we propose efficient computation of Kendall’s tau coefficients when determining the best split at each node of the regression tree. We show how this can be computed incrementally, thereby making the running time of our algorithm almost linear (up to a logarithmic factor) in the size of the input.

ADAPATCH is an adaptive patch-based saliency map method for explainable time series forecasting, which provides local, post-hoc visualization explanations. The approach highlights those patches which would result in worse predictions when hidden to the black-box algorithm. With a differential encoding module in the mask of input, the optimization can be done by gradient-based perturbation. ADAPATCH does not need the patch parameters upfront, such as the length or the stride, as all patch-based approaches need. In fact, it learns those parameters from the data, thereby effectively adapting to different settings and application scenarios. By enforcing an upper bound on the maximum number of patches, we make sure that the patch-level explanations provided by our algorithm can be easily interpreted by humans, as opposed to explanations consisting of a large number of single time points. Moreover, ADAPATCH requires a much smaller number of parameters, typically linear in the number of patches as opposed to linear in the number of time steps. This makes our approach more efficient and easy to train.

Both methods are model-agnostic, which means the architecture of the black-box model can be hidden from the users. They provide accurate and simple explanations, as validated by extensive experiments.