Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.
Chaos Engineering Based Kubernetes Pod Rescheduling Through Deep Sets and Reinforcement Learning
Zaccarini, Mattia;Poltronieri, Filippo;Stefanelli, Cesare;Tortonesi, Mauro
2025
Abstract
Kubernetes (K8S) is a widely used orchestration solution that helps manage complex IT applications by providing mechanisms for autoscaling, health checking, cluster formation, and replication, which are essential to deploy and manage the multitude of connected microservices. However, they may suffer in case of unexpected faults which can severely change the underlying computing infrastructure and lead to service outages, highlighting the need for resilient solutions capable of mitigating the adverse effects of faults. To address this, the TELKA sched-uler integrates Chaos Engineering (CE), Reinforcement Learning (RL), and Digital Twin (DT) to reallocate K8S pods evicted due to unexpected faults. While TELKA showed promising results in reallocating evicted pods, its preliminary implementations suffered from scalability issues, as the RL agent could only effectively operate on scenarios with the same number of nodes seen during training. To overcome this limitation, this paper improves TELKA by incorporating a neural network architecture called Deep Sets (DS), which can generalize the operation of TELKA on different numbers of nodes. Experimental results not only demonstrate the validity of the improved TELKA but also show how it can be used to identify good operating conditions.I documenti in SFERA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


