TY - GEN
T1 - Dynamic error mitigation in NoCs using intelligent prediction techniques
AU - DiTomaso, Dominic
AU - Boraten, Travis
AU - Kodi, Avinash
AU - Louri, Ahmed
N1 - Funding Information:
This research was partially supported by NSF grants CCF-1054339
Publisher Copyright:
© 2016 IEEE.
PY - 2016/12/14
Y1 - 2016/12/14
N2 - Network-on-chips (NoCs) are quickly becoming the standard communication fabric for multi-core systems. As technology continues to scale down into the nanometer regime, device behavior will become increasingly unreliable due to a combination of aging, soft errors, aggressive transistor design, and process-voltage-Temperature variations. Further, stringent timing constraints in NoCs are designed so that data can be pushed faster. The net result is an increase in errors which must be mitigated by the NoC. Typical techniques for handling faults are often reactive as they respond to faults after the error has occurred, making the recovery process inefficient in energy and time. In this paper, we take a different approach wherein we propose to use proactive, fault-Tolerant schemes to be employed before the fault affects the system. We propose to utilize machine learning techniques to train a decision tree which can be used to predict faults efficiently in the network. Based on the prediction model, we dynamically mitigate these predicted faults through error correction codes (ECC) and relaxed timing transmission. Our results indicate that, on average, we can accurately predict timing errors 60.6% better than a static single error correction and double error detection (SECDED) technique resulting in an average 26.8% reduction in retransmitted packets, a average net speedup of 3.31 x, and an average energy savings of 60.0% over other designs for real traffic patterns.
AB - Network-on-chips (NoCs) are quickly becoming the standard communication fabric for multi-core systems. As technology continues to scale down into the nanometer regime, device behavior will become increasingly unreliable due to a combination of aging, soft errors, aggressive transistor design, and process-voltage-Temperature variations. Further, stringent timing constraints in NoCs are designed so that data can be pushed faster. The net result is an increase in errors which must be mitigated by the NoC. Typical techniques for handling faults are often reactive as they respond to faults after the error has occurred, making the recovery process inefficient in energy and time. In this paper, we take a different approach wherein we propose to use proactive, fault-Tolerant schemes to be employed before the fault affects the system. We propose to utilize machine learning techniques to train a decision tree which can be used to predict faults efficiently in the network. Based on the prediction model, we dynamically mitigate these predicted faults through error correction codes (ECC) and relaxed timing transmission. Our results indicate that, on average, we can accurately predict timing errors 60.6% better than a static single error correction and double error detection (SECDED) technique resulting in an average 26.8% reduction in retransmitted packets, a average net speedup of 3.31 x, and an average energy savings of 60.0% over other designs for real traffic patterns.
UR - http://www.scopus.com/inward/record.url?scp=85009391343&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85009391343&partnerID=8YFLogxK
U2 - 10.1109/MICRO.2016.7783734
DO - 10.1109/MICRO.2016.7783734
M3 - Conference contribution
AN - SCOPUS:85009391343
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
BT - MICRO 2016 - 49th Annual IEEE/ACM International Symposium on Microarchitecture
PB - IEEE Computer Society
T2 - 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016
Y2 - 15 October 2016 through 19 October 2016
ER -