Poster accepted to DHBW AI Transfer Congress 2023

Steered Training Data Generation for Learned Semantic Type Detection

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

The poster introduces STEER to adapt learned semantic type extraction approaches to a new, unseen data lake. STEER provides a data programming framework for semantic labeling which is used to generate new labeled training data with minimal overhead. At its core, STEER comes with a novel training data generation procedure called Steered-Labeling that can generate high quality training data not only for non-numeric but also for numerical columns. With this generated training data STEER is able to fine-tune existing learned semantic type extraction models. We evaluate our approach on four different data lakes and show that we can significantly improve the performance of two different types of learned models across all data lakes.

https://www.dhbw.de/fileadmin/user_upload/Dokumente/Forschung/AI_Transfer_Congress/Proceedings_DHBW_AITC_2023.pdf