The Virtualizing Emilia Romagna Air Quality dig- ital twin project integrates heterogeneous regional Internet of Things data, including various kind and data source measure- ments, into a unified Big Data platform. This requires a robust ingestion and normalization pipeline for the MARGHERITA infrastructure. Key challenges stem from inherent data hetero- geneity across sources and the lack of native data integrity enforcement in the data lake environment. This paper presents a scalable ingestion pipeline utilizing Apache Kafka for data collection and Apache Spark for transformation, checking, and normalization into a data lake based on Apache Iceberg. Data integrity, crucial given the non-fully trusted raw sources, is ensured by implementing referential integrity checks directly within the Spark ingestion jobs. This application-level approach was preferred over an alternative solution based on an external database due to its scalability, lower cost, lower latency, and avoidance of a single point of failure. The implemented pipeline provides a robust data foundation for the project analytics and visualization layers, supporting informed decision-making for the regional Public Administration.
Regional IoT Data Integration in a Big-Data Framework
A. OdorizziSecondo
;G. Mazzini
Ultimo
2025
Abstract
The Virtualizing Emilia Romagna Air Quality dig- ital twin project integrates heterogeneous regional Internet of Things data, including various kind and data source measure- ments, into a unified Big Data platform. This requires a robust ingestion and normalization pipeline for the MARGHERITA infrastructure. Key challenges stem from inherent data hetero- geneity across sources and the lack of native data integrity enforcement in the data lake environment. This paper presents a scalable ingestion pipeline utilizing Apache Kafka for data collection and Apache Spark for transformation, checking, and normalization into a data lake based on Apache Iceberg. Data integrity, crucial given the non-fully trusted raw sources, is ensured by implementing referential integrity checks directly within the Spark ingestion jobs. This application-level approach was preferred over an alternative solution based on an external database due to its scalability, lower cost, lower latency, and avoidance of a single point of failure. The implemented pipeline provides a robust data foundation for the project analytics and visualization layers, supporting informed decision-making for the regional Public Administration.I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


