A reinforcement learning framework for self-healing fault recovery in intent-based SDNs

Makiyah, Estabraq H., Rasool, M. N. and Rawdhan, F. A. (2026) A reinforcement learning framework for self-healing fault recovery in intent-based SDNs. Discover Networks, 2 (1). p. 14. ISSN 3004-9792

Text
44354_2026_Article_28.pdf - Published Version
Available under License Creative Commons Attribution.
Download (5MB)

Access this via: https://link.springer.com/article/10.1007/s44354-0...

Abstract

This paper presents a reinforcement learning-based self-healing framework for Software-Defined Networking (SDN) that autonomously manages diverse network faults in a realistically emulated environment. A Mininet testbed controlled by the Faucet SDN controller is instrumented with Prometheus to collect multi-source telemetry, while an automated fault injector and congestion generator produce link, port, flow and controller events alongside UDP-induced bottlenecks to create rich training data. Network features–including controller CPU and memory usage, OpenFlow statistics, port status and explicit fault labels–are periodically scraped and aggregated into a structured dataset that forms the state space of a custom Gym-compatible environment. A Proximal Policy Optimisation (PPO) agent with a multilayer perceptron policy learns discrete self-healing actions such as no-op, port resets, switch restarts and bespoke recovery procedures, guided by a reward function that penalises persistent faults and unnecessary interventions while rewarding timely and appropriate recovery. Experimental evaluation over multiple PPO training runs shows stable optimisation behaviour and high episodic rewards with long episode lengths. Policy output analysis stratified by fault state confirms that the agent has learned state-conditional recovery behaviour, selecting fault-type-appropriate actions in over 85% of fault-state timesteps, thereby providing direct evidence that the agent successfully distinguishes healthy from faulty conditions and among different fault types at the level of individual recovery decisions. Compared with existing RL-based approaches that focus primarily on link failure recovery or service function chain reconfiguration, the proposed framework handles a broader spectrum of SDN fault types and integrates control-plane, data-plane and congestion indicators, thereby offering a more general and robust self-healing capability for operational SDN environments.

Item Type:	Article
Additional Information:	From Springer Nature via Jisc Publications Router History: received 27-12-2025; rev-recd 17-04-2026; registration 20-04-2026; accepted 20-04-2026; epub 17-05-2026; online 17-05-2026; collection 01-12-2026. ** Licence for this article: http://creativecommons.org/licenses/by-nc-nd/4.0/
Keywords:	Faucet controller, Self-healing, PPO, SDN, Networks, Reinforcement learning
Divisions:	College of Creative Arts, Technology and Engineering > Computing
SWORD Depositor:	JISC Router
Depositing User:	JISC Router
Date Deposited:	21 May 2026 12:01
Last Modified:	08 Jul 2026 13:55
URI:	https://bnu.repository.guildhe.ac.uk/id/eprint/20994

Actions (login required)

Edit Item