Paper accepted to EDBT 2024

Pythagoras: Semantic Type Detection of Numerical Data in Enterprise Data Lakes

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

Detecting semantic types of table columns is a crucial task to enable dataset discovery in data lakes. However, prior semantic type detection approaches have primarily focused on non-numerical data despite the fact that numerical data play an essential role in many real-world enterprise data lakes. Therefore, existing models are typically rather inadequate when applied to data lakes that contain a high proportion of numerical data. In this paper, we introduce Pythagoras, our new learned semantic type detection approach specially designed to support numerical along with non-numerical data. Pythagoras uses a GNN in combination with a novel graph representation of tables to predict the semantic types for numerical data with high accuracy. In our experiments, we compare Pythagoras against five state-of-the-art approaches using two different datasets and show that our model significantly outperforms these baselines on numerical data. In comparison to the best existing approach, we achieve F1-Score increases of around +22%, which sets new benchmarks.

https://dx.doi.org/10.48786/edbt.2024.62