The Virtualizing Emilia Romagna Air Quality dig- ital twin project integrates heterogeneous regional Internet of Things data, including various kind and data source measure- ments, into a unified Big Data platform. This requires a robust ingestion and normalization pipeline for the MARGHERITA infrastructure. Key challenges stem from inherent data hetero- geneity across sources and the lack of native data integrity enforcement in the data lake environment. This paper presents a scalable ingestion pipeline utilizing Apache Kafka for data collection and Apache Spark for transformation, checking, and normalization into a data lake based on Apache Iceberg. Data integrity, crucial given the non-fully trusted raw sources, is ensured by implementing referential integrity checks directly within the Spark ingestion jobs. This application-level approach was preferred over an alternative solution based on an external database due to its scalability, lower cost, lower latency, and avoidance of a single point of failure. The implemented pipeline provides a robust data foundation for the project analytics and visualization layers, supporting informed decision-making for the regional Public Administration.

Regional IoT Data Integration in a Big-Data Framework

A. Odorizzi
Secondo
;
G. Mazzini
Ultimo
2025

Abstract

The Virtualizing Emilia Romagna Air Quality dig- ital twin project integrates heterogeneous regional Internet of Things data, including various kind and data source measure- ments, into a unified Big Data platform. This requires a robust ingestion and normalization pipeline for the MARGHERITA infrastructure. Key challenges stem from inherent data hetero- geneity across sources and the lack of native data integrity enforcement in the data lake environment. This paper presents a scalable ingestion pipeline utilizing Apache Kafka for data collection and Apache Spark for transformation, checking, and normalization into a data lake based on Apache Iceberg. Data integrity, crucial given the non-fully trusted raw sources, is ensured by implementing referential integrity checks directly within the Spark ingestion jobs. This application-level approach was preferred over an alternative solution based on an external database due to its scalability, lower cost, lower latency, and avoidance of a single point of failure. The implemented pipeline provides a robust data foundation for the project analytics and visualization layers, supporting informed decision-making for the regional Public Administration.
2025
IoT, Big-data, Ingest Pipeline, Data Integrity
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2624514
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact