TY - JOUR
T1 - Architecture-aware coding for distributed storage
T2 - Repairable block failure resilient codes
AU - Calis, Gokhan
AU - Koyluoglu, O. Ozan
N1 - Funding Information:
G. Calis is with Western Digital Corporation, San Jose, CA 95119. O. Ozan Koyluoglu is with University of California, Berkeley, CA 94720. The paper was submitted when both authors were with Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85712. This paper was in part presented at 2014 IEEE International Symposium on Information Theory (ISIT 2014), Honolulu, HI, June 2014. This work is supported in part by the National Science Foundation under Grants No CCF-1563622 and CNS-1617335. ∗ Corresponding author: Gokhan Calis.
Publisher Copyright:
© 2018 AIMS.
PY - 2018
Y1 - 2018
N2 - In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to the available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., a file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from a subset of available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting a subset of remaining blocks in the system. File size bounds for repairable BFR codes are derived, and the trade-off between per node storage and repair bandwidth is analyzed, and the corresponding minimum storage regenerating (BFR-MSR) and minimum bandwidth regenerating (BFR-MBR) points are derived. Explicit codes achieving the two operating points for a special case of parameters are constructed, wherein the underlying regenerating codewords are distributed to BFR codeword symbols according to combinatorial designs. Finally, BFR locally repairable codes (BFR-LRC) are introduced, an upper bound on the resilience is derived and optimal code construction are provided by a concatenation of Gabidulin and MDS codes. Repair efficiency of BFR-LRC is further studied via the use of BFR-MSR/MBR codes as local codes. Code constructions achieving optimal resilience for BFR-MSR/MBR-LRCs are provided for certain parameter regimes. Overall, this work introduces the framework of block failures along with optimal code constructions, and the study of architecture-aware coding for distributed storage systems.
AB - In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to the available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., a file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from a subset of available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting a subset of remaining blocks in the system. File size bounds for repairable BFR codes are derived, and the trade-off between per node storage and repair bandwidth is analyzed, and the corresponding minimum storage regenerating (BFR-MSR) and minimum bandwidth regenerating (BFR-MBR) points are derived. Explicit codes achieving the two operating points for a special case of parameters are constructed, wherein the underlying regenerating codewords are distributed to BFR codeword symbols according to combinatorial designs. Finally, BFR locally repairable codes (BFR-LRC) are introduced, an upper bound on the resilience is derived and optimal code construction are provided by a concatenation of Gabidulin and MDS codes. Repair efficiency of BFR-LRC is further studied via the use of BFR-MSR/MBR codes as local codes. Code constructions achieving optimal resilience for BFR-MSR/MBR-LRCs are provided for certain parameter regimes. Overall, this work introduces the framework of block failures along with optimal code constructions, and the study of architecture-aware coding for distributed storage systems.
KW - Distributed storage and regenerating codes
KW - Information theory
UR - http://www.scopus.com/inward/record.url?scp=85052139536&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052139536&partnerID=8YFLogxK
U2 - 10.3934/amc.2018028
DO - 10.3934/amc.2018028
M3 - Article
AN - SCOPUS:85052139536
VL - 12
SP - 465
EP - 503
JO - Advances in Mathematics of Communications
JF - Advances in Mathematics of Communications
SN - 1930-5346
IS - 3
ER -