SFERA Archivio dei prodotti della Ricerca dell'Università di Ferrara

Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.

Chaos Engineering Based Kubernetes Pod Rescheduling Through Deep Sets and Reinforcement Learning

Zaccarini, Mattia;Poltronieri, Filippo;Borsatti, Davide;Cerroni, Walter;Foschini, Luca;Grabarnik, Genady Ya.;Scotece, Domenico;Shwartz, Larisa;Stefanelli, Cesare;Tortonesi, Mauro

2025

Abstract

Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno di pubblicazione

2025

Parole chiave

Chaos Engineering
Digital Twin
Kubernetes
Optimization
Reinforcement Learning

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2611208

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

social impact