Paper accepted to LWDA 2023

Pythagoras: Semantic Type Detection of Numerical Data Using Graph Neural Networks

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

Detecting semantic types of table columns is a crucial task to enable dataset discovery in data lakes. However, prior semantic type detection approaches have primarily focused on non-numeric data despite the fact that numeric data play an essential role in many enterprise data lakes. Therefore, typically, existing models are rather inadequate when applied to data lakes that contain a high proportion of numerical data. In this paper, we introduce Pythagoras, our new learned semantic type detection approach specially designed to support numerical data along with non-numerical data. Pythagoras uses a graph neural network based on a new graph representation of tables to predict the semantic types for numerical data with high accuracy. In our initial experiments, we thus achieve F1-Scores of 0.829 (support-weighted) and 0.790 (macro), respectively, exceeding the state-of-the-art performance significantly.

https://ceur-ws.org/Vol-3630/LWDA2023-paper13.pdf