Tuesday, April 29, 2025, 11:45, 4A301

Simon Razniewski (TU Dresden)

GPTKB: Comprehensively Materializing Factual LLM Knowledge

LLMs have majorly advanced NLP and AI, and next to their ability to perform a wide range of procedural tasks, a major success factor is their internalized factual knowledge. Since (Petroni et al., 2019), analyzing this knowledge has gained attention. However, most approaches investigate one question at a time via modest-sized pre-defined samples, introducing an “availability bias” (Tversky and Kahneman, 1973) that prevents the discovery of knowledge (or beliefs) of LLMs beyond the experimenter’s predisposition. To address this challenge, we propose a novel methodology to comprehensively materialize an LLM’s factual knowledge through recursive querying and result consolidation. As a prototype, we employ GPT-4o-mini to construct GPTKB, a large-scale knowledge base (KB) comprising 101 million triples for over 2.9 million entities. This work marks a milestone in two areas: For LLM research, for the first time, it provides constructive insights into the scope and structure of LLMs’ knowledge (or beliefs), and its strengths and weaknesses. For KB construction, it pioneers new pathways for the long-standing challenge of general-domain KB construction. GPTKB is accessible at https://gptkb.org.

Tuesday, April 8, 2025, 11:45, 4A125

Pratik Karmakar

ProvSQL: Provenance and Probabilistic Querying in Uncertain Databases

Probabilistic databases provide a powerful framework for managing and querying uncertain data, enabling principled reasoning under uncertainty. ProvSQL extends PostgreSQL to support provenance tracking and probability computation in probabilistic databases, leveraging provenance circuits to efficiently compute probabilities and Shapley-based data valuations. In this talk, we introduce ProvSQL, demonstrate its capabilities, and explore a key use case—content based image retrieval from the COCO dataset. We show how probabilistic query evaluation and data valuation techniques enhance explainability and trust in AI-driven decision-making.

Tuesday, March 25, 2025, 11:45, 4A301

Gaël Varoquaux (INRIA)

Tabular foundation models: priors for numbers and strings

Deep-learning typically does not outperform tree-based models on tabular data. Often this may be explained by the small size of such datasets. For images, sound, text, the solution has be pretrained models, leading to foundation models, adapted and reused for many tasks. I will discuss the challenges to bring these ideas to tabular learning, and the progress that we have made, building priors for tables, ie columns of different natures, with numbers and strings.

Tuesday, March 18, 2025, 11:45, 4A301

Pierre Monnin (INRIA)

Neuro-symbolic approaches for the knowledge graph lifecycle

In the Web of Data, an increasing number of knowledge graphs (KGs) are concurrently published, edited, and accessed by human and software agents. Their wide adoption makes essential the tasks of their lifecycle: construction, refinement (e.g., matching, link prediction), mining, and usage to support applications (e.g., explainable AI, recommender systems). However, all these tasks require facing the inherent heterogeneity of KGs, e.g., in terms of granularities, vocabularies, and completeness. Besides, scalability issues arise due to their increasing size and combinatorial nature. In my talk, I will present my research on neuro-symbolic approaches for the KG lifecycle, intertwining domain knowledge from ontologies, deductive reasoning, analogical reasoning, and machine learning models. Throughout my presentation, I will show that such approaches enhance models by improving their semantic awareness, frugality, and the semantic interpretability of their latent representation space.

Tuesday, March 4, 2025, 11:45, 4A301

Ken Satoh (National Institute of Informatics, Japan)

Translating German traffic cases into logical rules

This is a joint work with May Myo Zin at my center and Georg Borgess at University of Saarland. In this talk, I will report the work on extracting normative sentences from German traffic cases and translating them into logical rules. The development of autonomous vehicles (AVs) requires a comprehensive understanding of both explicit and implicit traffic rules to ensure legal compliance and safety. While explicit traffic laws are well-defined in statutes and regulations, implicit rules derived from judicial interpretations and case law are more nuanced and challenging to extract. This research firstly investigates the potential of Large Language Models (LLMs), particularly GPT-4o, in automating the extraction of implicit traffic normative sentences from judicial decisions. Then we investigate how to translate these normative sentences into a logical form. We explore to use large language models (LLMs) to automate the translation of traffic rules into PROLOG, a declarative programming language ideal for encoding logical rules and relationships. The proposed methodology consists of three key phases: extracting traffic rules from diverse textual sources, structuring them into Logical English (LE) for clarity and consistency, and translating them into PROLOG representations using advanced natural language processing (NLP) techniques, including in-context learning and fine-tuning. The experimental results demonstrate the effectiveness of LLMs in automating this process, achieving high accuracy in translation.

Tuesday, February 4, 2025, 11:45, 4A125

Fabian Suchanek

YAGO

In this talk I will present the newest version of YAGO, the knowledge base that we are building with several members of the DIG team. I will show why we build it, how we build it, and how it can be used. This will also be an occasion for me to get your feedback on our work.