The Data, Intelligence and Graphs (DIG) team is a group of researchers at Télécom Paris working on the fundamental issues raised in databases, knowledge management, graph mining and artificial intelligence. Research interests cover theoretical foundations of data intelligence and graph systems, practical solutions and applications, as well as cognitive aspects.

There is an open position of Assistant / Associate Professor in AI for society in the team, starting December 2024. More details soon coming!

Check the newest version of the YAGO knowledge base!

The DIG team has strong industrial collaborations:


The DIG team is a proud signer of the TCS4F pledge for sustainable research in theoretical computer science.  A large majority of DIG members are signers of the No free view? No review! pledge in favor of open access:

Theoretical Computer Scientists for Future No free view? No review!


Knowledge Bases

A knowledge base is a computer-processable collection of knowledge about the world. We construct and mine such knowledge bases.

Graph Mining

Graphs are a near-universal way to represent data. We are concerned with mining graphs for patterns and properties. Our particular focus is on the scalability of such approaches.

  • Logo of scikit-networkscikit-network: scikit-network is a Python package for the analysis of large graphs (clustering, embedding, classification, ranking).

Social Web

The Web has evolved more and more into a social Web: content is produced and shared by users. In the DIG team, we follow and anticipate developments in this area.

  • Community detection: We are investigating means to detect and distinguish social communities on the Web.
  • Social Relations: We investigate the optimal investment in social relations from a theoretical point of view.

Language and Relevance

Computer science is not just about computers. In this area of research, we investigate how humans reason, and what this implies for machines.

  • Simplicity Theory: Simplicity theory seeks to explain the relevance of situations or events to human minds. See http://www.simplicitytheory.science
  • Relevance in natural language: The point is to retro-engineer methods to achieve meaningful and relevant speech from our understanding of human performance. Read this paper. Read more on this.
  • Communication as social signalling: We apply game theory and social simulation to explore conditions in which providing valuable (i.e. relevant) information is a profitable strategy. Read this paper. Read more on this.

Machine Learning for Data Streams

We investigate how to do machine learning in real time, contributing to new open source tools:

  • River: a Python library for online Machine Learning
  • MOA: Massive Online Analytics, a framework for mining data streams (in Java)
  • Apache SAMOA: Scalable Advanced Massive Online Analytics, an open source framework for data stream mining on the Hadoop Ecosystem


Talel Abdessalem Mehwish Alam Antoine Amarilli Albert Bifet Thomas Bonald
Jean-Louis Dessalles Nils Holzenberger Louis Jachiet  Mauro Sozio  Fabian Suchanek



PhD candidates


  • Bérénice Jaulmes. Advisors: Mehwish Alam, Fabian Suchanek
  • Sri Appakutti. Advisors: Nils Holzenberger, Fabian Suchanek
  • Nicoline Nymand-Andersen. Advisors: Thomas Bonald, Marc Jeanmougin
  • Roman Plaud. Advisors: Thomas Bonald, Mathieu Labeau and Antoine Saillenfest

Former members


An open position of Assistant / Associate Professor is available in the team!

Tuesday, September 24, 2024, 11:45, 4A125

Ambroise Odonnat Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a …

Tuesday, July 9, 2024, 11:45, 4A125

Peter Fratrič Mining behavior from a legal simulation environment: where we are and what lies ahead This talk presents a methodological framework for the use of simulation-based methods to investigate questions of non-compliance in a legal context. Its aim is to generate observed or previously unobserved instances of non-compliance and use them to improve compliance …

Tuesday, July 2, 2024, 12:15, 4A301

Chadi Helwe PhD defense practice talk This thesis focuses on evaluating and improving the reasoning abilities of Smaller Language Models (SLMs) and Large Language Models (LLMs). It explores SLMs’ performance on complex tasks and their limitations with simpler ones. This thesis introduces LogiTorch, a Python library that facilitates the training of models on various reasoning …