TY - JOUR
T1 - Limit of hardware solutions for self-protecting fault-tolerant NoCs
AU - Louri, Ahmed
AU - Collet, Jacques
AU - Karanth, Avinash
N1 - Funding Information:
This research was partially supported by NSF grants CCF-1420718, CCF-1513606, CCF-1703013, CCF-1547034, CCF-1547035, CCF-1540736, and CCF-1702980. Authors’ addresses: A. Louri, Department of Electrical and Computer Engineering, The George Washington University, 800 22nd Street NW, Room 5580, Washington DC 20052; J. Collet, Laboratoire d’Analyse et d’Architecture des Systèmes, Université Paul Sabatier, 7 avenue du colonel Roche, 31077 Toulouse Cedex 13; A. Karanth, School of Electrical Engineering and Computer Science, Ohio University, 322D Stocker Center, Athens, OH 45701. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2019 Association for Computing Machinery. 1550-4832/2019/01-ART4 $15.00 https://doi.org/10.1145/3233986
Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/1
Y1 - 2019/1
N2 - We study the ultimate limits of hardware solutions for the self-protection strategies against permanent faults in networks on chips (NoCs). NoCs reliability is improved by replacing each base router by an augmented router which includes extra protection circuitry. We compare the protection achieved by the self-test and self-protect (STAP) architectures to that of triple modular redundancy with voting (TMR). Two STAP architectures are considered. In the first one, a defective router self-disconnects from the network, while it self-heals in the second one. In practice, none of the considered architectures (STAP or TMR) can tolerate all the permanent faults, especially faults in the extra-circuitry for protection or voting, and consequently, there will always be some unidentified defective augmented routers which are going to transmit errors in an unpredictable manner. This study consists of tackling this fundamental problem. Specifically, we study and determine the average percentage of residual unidentified defective routers (UDRs) and their impact on the overall reliability of the NoC in light of self-protection strategies. Our study shows that TMR is the most efficient solution to limit the average percentage of UDRs when there are typically less than a 0.1 percent of defective base routers. However, TMR is also the most cost prohibitive and the least power efficient. Above 1% of defective base routers, the STAP approaches are more efficient although the protection efficiency decreases inexorably in the very defective technologies (e.g. when there is 10% or more of defective base routers). For instance, if the chip includes 10% of defective base routers, our study shows that there will remain on the average 1% of UDRs, which causes a major challenge for NoC reliability.
AB - We study the ultimate limits of hardware solutions for the self-protection strategies against permanent faults in networks on chips (NoCs). NoCs reliability is improved by replacing each base router by an augmented router which includes extra protection circuitry. We compare the protection achieved by the self-test and self-protect (STAP) architectures to that of triple modular redundancy with voting (TMR). Two STAP architectures are considered. In the first one, a defective router self-disconnects from the network, while it self-heals in the second one. In practice, none of the considered architectures (STAP or TMR) can tolerate all the permanent faults, especially faults in the extra-circuitry for protection or voting, and consequently, there will always be some unidentified defective augmented routers which are going to transmit errors in an unpredictable manner. This study consists of tackling this fundamental problem. Specifically, we study and determine the average percentage of residual unidentified defective routers (UDRs) and their impact on the overall reliability of the NoC in light of self-protection strategies. Our study shows that TMR is the most efficient solution to limit the average percentage of UDRs when there are typically less than a 0.1 percent of defective base routers. However, TMR is also the most cost prohibitive and the least power efficient. Above 1% of defective base routers, the STAP approaches are more efficient although the protection efficiency decreases inexorably in the very defective technologies (e.g. when there is 10% or more of defective base routers). For instance, if the chip includes 10% of defective base routers, our study shows that there will remain on the average 1% of UDRs, which causes a major challenge for NoC reliability.
KW - Built-in-self-test
KW - Network-on-chips
KW - Reliability
KW - Self-healing
UR - http://www.scopus.com/inward/record.url?scp=85061101764&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061101764&partnerID=8YFLogxK
U2 - 10.1145/3233986
DO - 10.1145/3233986
M3 - Article
AN - SCOPUS:85061101764
SN - 1550-4832
VL - 15
JO - ACM Journal on Emerging Technologies in Computing Systems
JF - ACM Journal on Emerging Technologies in Computing Systems
IS - 1
M1 - 4
ER -