Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.

Chaos Engineering Based Kubernetes Pod Rescheduling Through Deep Sets and Reinforcement Learning

Zaccarini, Mattia;Poltronieri, Filippo;Stefanelli, Cesare;Tortonesi, Mauro
2025

Abstract

Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.
2025
Chaos Engineering
Digital Twin
Kubernetes
Optimization
Reinforcement Learning
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11392/2611208
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact