TY - GEN
T1 - GraphQL-Aware Healing in Service-Oriented Architectures via Multi-Signal Learning
AU - Mani, Nariman
AU - Attaranasl, Salma
AU - He, Sen
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - This paper introduces an adaptive test and runtime healing approach that delivers resolver-level resilience for GraphQL service-oriented architectures by unifying three telemetry streams: semantic log embeddings obtained from large language models, structural dependencies encoded via graph neural networks, and statistically grounded operational metrics. These signals are fused into a single reinforcement learning state vector, enabling a deep Q-network to learn context-aware recovery actions including selective retry, safe skip, dependency reordering, and escalation without obscuring root causes. The approach is evaluated in a production-grade case study involving a real-world lifestyle coaching platform used by thousands of active users. The application's asynchronous, cloud-native architecture with complex resolver interactions and AI-powered personalization provides a realistic and challenging environment for assessing the system's robustness. Across more than one thousand simulated failure episodes that inject realistic cloud uncertainty, the approach improves test and runtime success rates from 68.7% to 92%, reduces mean-time-to-recovery from 687 ms to 203 ms, and trims CI compute time by 61% using a KL-stability early-stop rule. It also preserves tail-latency accuracy within a 5% error bound while incurring only 11.8 ms median inference overhead per healed request. These results demonstrate that statistically principled, reinforcement-learning-driven healing offers a practical, fine-grained self-recovery solution for serviceoriented systems deployed in modern, real-world cloud applications.
AB - This paper introduces an adaptive test and runtime healing approach that delivers resolver-level resilience for GraphQL service-oriented architectures by unifying three telemetry streams: semantic log embeddings obtained from large language models, structural dependencies encoded via graph neural networks, and statistically grounded operational metrics. These signals are fused into a single reinforcement learning state vector, enabling a deep Q-network to learn context-aware recovery actions including selective retry, safe skip, dependency reordering, and escalation without obscuring root causes. The approach is evaluated in a production-grade case study involving a real-world lifestyle coaching platform used by thousands of active users. The application's asynchronous, cloud-native architecture with complex resolver interactions and AI-powered personalization provides a realistic and challenging environment for assessing the system's robustness. Across more than one thousand simulated failure episodes that inject realistic cloud uncertainty, the approach improves test and runtime success rates from 68.7% to 92%, reduces mean-time-to-recovery from 687 ms to 203 ms, and trims CI compute time by 61% using a KL-stability early-stop rule. It also preserves tail-latency accuracy within a 5% error bound while incurring only 11.8 ms median inference overhead per healed request. These results demonstrate that statistically principled, reinforcement-learning-driven healing offers a practical, fine-grained self-recovery solution for serviceoriented systems deployed in modern, real-world cloud applications.
KW - Adaptive Test Healing
KW - Flaky Tests
KW - Graph Neural Networks (GNN)
KW - Large Language Models (LLMs)
KW - Reinforcement Learning (RL)
UR - https://www.scopus.com/pages/publications/105016165220
UR - https://www.scopus.com/pages/publications/105016165220#tab=citedBy
U2 - 10.1109/SOSE67019.2025.00021
DO - 10.1109/SOSE67019.2025.00021
M3 - Conference contribution
AN - SCOPUS:105016165220
T3 - Proceedings - 19th IEEE International Conference on Service-Oriented System Engineering, SOSE 2025
SP - 140
EP - 150
BT - Proceedings - 19th IEEE International Conference on Service-Oriented System Engineering, SOSE 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th IEEE International Conference on Service-Oriented System Engineering, SOSE 2025
Y2 - 21 July 2025 through 24 July 2025
ER -