SFERA Archivio dei prodotti della Ricerca dell'Università di Ferrara

Semantic segmentation assigns a class to every pixel of an image to automatically locate objects in the context of computer vision applications for autonomous vehicles, robotics, agriculture, gaming, and medical imaging. Deep Neural Network models, such as Convolutional Neural Networks (CNNs), are widely used for this purpose. Among the plethora of models, the U-Net is a standard in biomedical imaging. Nowadays, GPUs efficiently perform segmentation and are the reference architectures for running CNNs, and FPGAs compete for inferences among alternative platforms, promising higher energy efficiency and lower latency solutions. In this contribution, we evaluate the performance of FPGA-based Deep Processing Units (DPUs) implemented on the AMD Alveo U55C for the inference task, using calcium segmentation in cardiac aortic valve computer tomography scans as a benchmark. We design and implement a U-Net-based application, optimize the hyperparameters to maximize the prediction accuracy, perform pruning to simplify the model, and use different numerical quantizations to exploit low-precision operations supported by the DPUs and GPUs to boost the computation time. We describe how to port and deploy the U-Net model on DPUs, and we compare accuracy, throughput, and energy efficiency achieved with four generations of GPUs and a recent dual 32-core high-end CPU platform. Our results show that a complex DNN like the U-Net can run effectively on DPUs using 8-bit integer computation, achieving a prediction accuracy of approximately 95% in Dice and 91% in IoU scores. These results are comparable to those measured when running the floating-point models on GPUs and CPUs. On the one hand, in terms of computing performance, the DPUs achieves a inference latency of approximately 3.5 ms and a throughput of approximately 4.2 kPFS, boosting the performance of a 64-core CPU system by approximately 10% in terms of latency and a factor 2X in terms of throughput, but still do not overcoming the performance of GPUs when using the same numerical precision. On the other hand, considering the energy efficiency, the improvements are approximately a factor 6.7X compared to the CPU, and 1.6X compared to the P100 GPU manufactured with the same technological process (16 nm).

Benchmarking a DNN for aortic valve calcium lesions segmentation on FPGA-based DPU using the vitis AI toolchain

Sisini, Valentina^Primo;Miola, Andrea;Minghini, Giada;Calore, Enrico;Cavallo, Armando Ugo;Schifano, Sebastiano Fabio;Zambelli, Cristian^Ultimo

2026

Abstract

Semantic segmentation assigns a class to every pixel of an image to automatically locate objects in the context of computer vision applications for autonomous vehicles, robotics, agriculture, gaming, and medical imaging. Deep Neural Network models, such as Convolutional Neural Networks (CNNs), are widely used for this purpose. Among the plethora of models, the U-Net is a standard in biomedical imaging. Nowadays, GPUs efficiently perform segmentation and are the reference architectures for running CNNs, and FPGAs compete for inferences among alternative platforms, promising higher energy efficiency and lower latency solutions. In this contribution, we evaluate the performance of FPGA-based Deep Processing Units (DPUs) implemented on the AMD Alveo U55C for the inference task, using calcium segmentation in cardiac aortic valve computer tomography scans as a benchmark. We design and implement a U-Net-based application, optimize the hyperparameters to maximize the prediction accuracy, perform pruning to simplify the model, and use different numerical quantizations to exploit low-precision operations supported by the DPUs and GPUs to boost the computation time. We describe how to port and deploy the U-Net model on DPUs, and we compare accuracy, throughput, and energy efficiency achieved with four generations of GPUs and a recent dual 32-core high-end CPU platform. Our results show that a complex DNN like the U-Net can run effectively on DPUs using 8-bit integer computation, achieving a prediction accuracy of approximately 95% in Dice and 91% in IoU scores. These results are comparable to those measured when running the floating-point models on GPUs and CPUs. On the one hand, in terms of computing performance, the DPUs achieves a inference latency of approximately 3.5 ms and a throughput of approximately 4.2 kPFS, boosting the performance of a 64-core CPU system by approximately 10% in terms of latency and a factor 2X in terms of throughput, but still do not overcoming the performance of GPUs when using the same numerical precision. On the other hand, considering the energy efficiency, the improvements are approximately a factor 6.7X compared to the CPU, and 1.6X compared to the P100 GPU manufactured with the same technological process (16 nm).

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	DOI
	
				https://dx.doi.org/10.1016/j.future.2025.108115
			
	Titolo della Rivista
	
				FUTURE GENERATION COMPUTER SYSTEMS
			
	Tutti gli autori
	
						Sisini, Valentina; Miola, Andrea; Minghini, Giada; Calore, Enrico; Cavallo, Armando Ugo; Schifano, Sebastiano Fabio; Zambelli, Cristian
					
	Appare nelle tipologie:
	
				03.1 Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Benchmarking_a_DNN_for_aortic_valve_calcium_lesions_segmentation-job_670.pdf accesso aperto Descrizione: Full text editoriale Tipologia: Full text (versione editoriale) Licenza: Creative commons Dimensione 2.31 MB Formato Adobe PDF Visualizza/Apri	2.31 MB	Adobe PDF	Visualizza/Apri

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2598950

Citazioni

ND

0

ND

social impact