TY - GEN
T1 - High-performance, Energy-efficient, Fault-tolerant Network-on-Chip Design Using Reinforcement Learnin
AU - Wang, Ke
AU - Louri, Ahmed
AU - Karanth, Avinash
AU - Bunescu, Razvan
N1 - Publisher Copyright:
© 2019 EDAA.
PY - 2019/5/14
Y1 - 2019/5/14
N2 - Network-on-Chips (NoCs) are becoming the standard communication fabric for multi-core and system on a chip (SoC) architectures. As technology continues to scale, transistors and wires on the chip are becoming increasingly vulnerable to various fault mechanisms, especially timing errors, resulting in exacerbation of energy efficiency and performance for NoCs. Typical techniques for handling timing errors are reactive in nature, responding to the faults after their occurrence. They rely on error detection/correction techniques which have resulted in excessive power consumption and degraded performance, since the error detection/correction hardware is constantly enabled. On the other hand, indiscriminately disabling error handling hardware can induce more errors and intrusive retransmission traffic. Therefore, the challenge is to balance the trade-offs among error rate, packet retransmission, performance, and energy. In this paper, we propose a proactive fault-tolerant mechanism to optimize energy efficiency and performance with reinforcement learning (RL). First, we propose a new proactive error handling technique comprised of a dynamic scheme for enabling per-router error detection/correction hardware and an effective retransmission mechanism. Second, we propose the use of RL to train the dynamic control policy with the goals of providing increased fault-tolerance, reduced power consumption and improved performance as compared to conventional techniques. Our evaluation indicates that, on average, end-to-end packet latency is lowered by 55%, energy efficiency is improved by 64%, and retransmission caused by faults is reduced by 48% over the reactive error correction techniques.
AB - Network-on-Chips (NoCs) are becoming the standard communication fabric for multi-core and system on a chip (SoC) architectures. As technology continues to scale, transistors and wires on the chip are becoming increasingly vulnerable to various fault mechanisms, especially timing errors, resulting in exacerbation of energy efficiency and performance for NoCs. Typical techniques for handling timing errors are reactive in nature, responding to the faults after their occurrence. They rely on error detection/correction techniques which have resulted in excessive power consumption and degraded performance, since the error detection/correction hardware is constantly enabled. On the other hand, indiscriminately disabling error handling hardware can induce more errors and intrusive retransmission traffic. Therefore, the challenge is to balance the trade-offs among error rate, packet retransmission, performance, and energy. In this paper, we propose a proactive fault-tolerant mechanism to optimize energy efficiency and performance with reinforcement learning (RL). First, we propose a new proactive error handling technique comprised of a dynamic scheme for enabling per-router error detection/correction hardware and an effective retransmission mechanism. Second, we propose the use of RL to train the dynamic control policy with the goals of providing increased fault-tolerance, reduced power consumption and improved performance as compared to conventional techniques. Our evaluation indicates that, on average, end-to-end packet latency is lowered by 55%, energy efficiency is improved by 64%, and retransmission caused by faults is reduced by 48% over the reactive error correction techniques.
UR - http://www.scopus.com/inward/record.url?scp=85066612816&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066612816&partnerID=8YFLogxK
U2 - 10.23919/DATE.2019.8714869
DO - 10.23919/DATE.2019.8714869
M3 - Conference contribution
AN - SCOPUS:85066612816
T3 - Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019
SP - 1166
EP - 1171
BT - Proceedings of the 2019 Design, Automation and Test in Europe Conference and Exhibition, DATE 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 22nd Design, Automation and Test in Europe Conference and Exhibition, DATE 2019
Y2 - 25 March 2019 through 29 March 2019
ER -