Semantic segmentation assigns a class to every pixel of an image to automatically locate objects in the context of computer vision applications for autonomous vehicles, robotics, agriculture, gaming, and medical imaging. Deep Neural Network models, such as Convolutional Neural Networks (CNNs), are widely used for this purpose. Among the plethora of models, the U-Net is a standard in biomedical imaging. Nowadays, GPUs efficiently perform segmentation and are the reference architectures for running CNNs, and FPGAs compete for inferences among alternative platforms, promising higher energy efficiency and lower latency solutions. In this contribution, we evaluate the performance of FPGA-based Deep Processing Units (DPUs) implemented on the AMD Alveo U55C for the inference task, using calcium segmentation in cardiac aortic valve computer tomography scans as a benchmark. We design and implement a U-Net-based application, optimize the hyperparameters to maximize the prediction accuracy, perform pruning to simplify the model, and use different numerical quantizations to exploit low-precision operations supported by the DPUs and GPUs to boost the computation time. We describe how to port and deploy the U-Net model on DPUs, and we compare accuracy, throughput, and energy efficiency achieved with four generations of GPUs and a recent dual 32-core high-end CPU platform. Our results show that a complex DNN like the U-Net can run effectively on DPUs using 8-bit integer computation, achieving a prediction accuracy of approximately 95% in Dice and 91% in IoU scores. These results are comparable to those measured when running the floating-point models on GPUs and CPUs. On the one hand, in terms of computing performance, the DPUs achieves a inference latency of approximately 3.5 ms and a throughput of approximately 4.2 kPFS, boosting the performance of a 64-core CPU system by approximately 10% in terms of latency and a factor 2X in terms of throughput, but still do not overcoming the performance of GPUs when using the same numerical precision. On the other hand, considering the energy efficiency, the improvements are approximately a factor 6.7X compared to the CPU, and 1.6X compared to the P100 GPU manufactured with the same technological process (16 nm).

Benchmarking a DNN for aortic valve calcium lesions segmentation on FPGA-based DPU using the vitis AI toolchain

Sisini, Valentina
Primo
;
Miola, Andrea;Minghini, Giada;Calore, Enrico;Schifano, Sebastiano Fabio
;
Zambelli, Cristian
Ultimo
2026

Abstract

Semantic segmentation assigns a class to every pixel of an image to automatically locate objects in the context of computer vision applications for autonomous vehicles, robotics, agriculture, gaming, and medical imaging. Deep Neural Network models, such as Convolutional Neural Networks (CNNs), are widely used for this purpose. Among the plethora of models, the U-Net is a standard in biomedical imaging. Nowadays, GPUs efficiently perform segmentation and are the reference architectures for running CNNs, and FPGAs compete for inferences among alternative platforms, promising higher energy efficiency and lower latency solutions. In this contribution, we evaluate the performance of FPGA-based Deep Processing Units (DPUs) implemented on the AMD Alveo U55C for the inference task, using calcium segmentation in cardiac aortic valve computer tomography scans as a benchmark. We design and implement a U-Net-based application, optimize the hyperparameters to maximize the prediction accuracy, perform pruning to simplify the model, and use different numerical quantizations to exploit low-precision operations supported by the DPUs and GPUs to boost the computation time. We describe how to port and deploy the U-Net model on DPUs, and we compare accuracy, throughput, and energy efficiency achieved with four generations of GPUs and a recent dual 32-core high-end CPU platform. Our results show that a complex DNN like the U-Net can run effectively on DPUs using 8-bit integer computation, achieving a prediction accuracy of approximately 95% in Dice and 91% in IoU scores. These results are comparable to those measured when running the floating-point models on GPUs and CPUs. On the one hand, in terms of computing performance, the DPUs achieves a inference latency of approximately 3.5 ms and a throughput of approximately 4.2 kPFS, boosting the performance of a 64-core CPU system by approximately 10% in terms of latency and a factor 2X in terms of throughput, but still do not overcoming the performance of GPUs when using the same numerical precision. On the other hand, considering the energy efficiency, the improvements are approximately a factor 6.7X compared to the CPU, and 1.6X compared to the P100 GPU manufactured with the same technological process (16 nm).
2026
Sisini, Valentina; Miola, Andrea; Minghini, Giada; Calore, Enrico; Cavallo, Armando Ugo; Schifano, Sebastiano Fabio; Zambelli, Cristian
File in questo prodotto:
File Dimensione Formato  
Benchmarking_a_DNN_for_aortic_valve_calcium_lesions_segmentation-job_670.pdf

accesso aperto

Descrizione: Full text editoriale
Tipologia: Full text (versione editoriale)
Licenza: Creative commons
Dimensione 2.31 MB
Formato Adobe PDF
2.31 MB Adobe PDF Visualizza/Apri

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2598950
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact