TY - GEN
T1 - On the power of in-network caching in the Hadoop distributed file system
AU - Newberry, Eric
AU - Zhang, Beichuan
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s).
PY - 2019/9/24
Y1 - 2019/9/24
N2 - The Hadoop Distributed File System (HDFS) is a network file system used to support multiple widely-used big data frameworks that can scale to run on large clusters. In this paper, we evaluate the effectiveness of using in-network caching on switches in HDFS-supported clusters in order to reduce per-link bandwidth usage in the network. We discovered that some applications featured large amounts of data requested by multiple clients and that, by caching read data in the network, the average per-link bandwidth usage of read operations in these applications could be reduced by more than half. We also found that the choice of cache replacement policy could have a significant impact on caching effectiveness in this environment, with LIRS and ARC generally performing the best for larger and smaller cache sizes, respectively. Moreover, given the structure of HDFS write operations, we developed a mechanism to reduce the total per-link bandwidth usage of HDFS write operations by replacing write pipelining with multicast. In order to evaluate in-network caching potential, we developed a simulator to replay real traces through a fat tree network simulating the caching architecture used in the Named Data Networking (NDN) information-centric networking (ICN) architecture. Our results suggest that ICN-style in-network caching can provide significant benefits to HDFS-supported big data clusters, justifying future work to apply ICN architectures to cluster environments.
AB - The Hadoop Distributed File System (HDFS) is a network file system used to support multiple widely-used big data frameworks that can scale to run on large clusters. In this paper, we evaluate the effectiveness of using in-network caching on switches in HDFS-supported clusters in order to reduce per-link bandwidth usage in the network. We discovered that some applications featured large amounts of data requested by multiple clients and that, by caching read data in the network, the average per-link bandwidth usage of read operations in these applications could be reduced by more than half. We also found that the choice of cache replacement policy could have a significant impact on caching effectiveness in this environment, with LIRS and ARC generally performing the best for larger and smaller cache sizes, respectively. Moreover, given the structure of HDFS write operations, we developed a mechanism to reduce the total per-link bandwidth usage of HDFS write operations by replacing write pipelining with multicast. In order to evaluate in-network caching potential, we developed a simulator to replay real traces through a fat tree network simulating the caching architecture used in the Named Data Networking (NDN) information-centric networking (ICN) architecture. Our results suggest that ICN-style in-network caching can provide significant benefits to HDFS-supported big data clusters, justifying future work to apply ICN architectures to cluster environments.
KW - Big data
KW - Caching
KW - HDFS
KW - ICN
KW - Information-centric networking
KW - NDN
KW - Named data networking
KW - Spark
UR - http://www.scopus.com/inward/record.url?scp=85074084519&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074084519&partnerID=8YFLogxK
U2 - 10.1145/3357150.3357392
DO - 10.1145/3357150.3357392
M3 - Conference contribution
AN - SCOPUS:85074084519
T3 - ICN 2019 - Proceedings of the 2019 Conference on Information-Centric Networking
SP - 89
EP - 99
BT - ICN 2019 - Proceedings of the 2019 Conference on Information-Centric Networking
PB - Association for Computing Machinery, Inc
T2 - 6th ACM Conference on Information-Centric Networking, ICN 2019
Y2 - 24 September 2019 through 26 September 2019
ER -