TY - GEN
T1 - Stargate
T2 - 36th Annual ACM Symposium on Applied Computing, SAC 2021
AU - Choi, Illyoung
AU - Hartman, John H.
N1 - Funding Information:
We thank Hurwitz Lab for support and comments on the system design. This research was funded in part by NSF grants OAR-1640775 and OAR-1541318.
Publisher Copyright:
© 2021 ACM.
PY - 2021/3/22
Y1 - 2021/3/22
N2 - The transfer of large-scale datasets between geographically separated systems is a challenge in scientific computing, made even more complicated when the systems are clusters of computers. In this paper we present Stargate, a file system that enables efficient on-demand remote data access for Hadoop-based scientific computations. Stargate uses a content-addressable protocol, on-demand access, and multi-tier caching to address the challenges of large data transfers over a WAN. Stargate also uses a novel approach that co-locates computations and transfers to achieve efficient data access in cluster computing. Unlike other approaches, Stargate is implemented as an independent file system service that works with any computation framework. In our experiments Stargate's performance on heavy I/O workloads was 7% faster than WebHDFS and only 8% slower than HDFS. In addition, Stargate's caches effectively trade high-cost WAN traffic for low-cost LAN traffic. Stargate's performance, on-demand data access, and reduction in WAN traffic make it a good platform for providing remote dataset access to scientific computations on clusters.
AB - The transfer of large-scale datasets between geographically separated systems is a challenge in scientific computing, made even more complicated when the systems are clusters of computers. In this paper we present Stargate, a file system that enables efficient on-demand remote data access for Hadoop-based scientific computations. Stargate uses a content-addressable protocol, on-demand access, and multi-tier caching to address the challenges of large data transfers over a WAN. Stargate also uses a novel approach that co-locates computations and transfers to achieve efficient data access in cluster computing. Unlike other approaches, Stargate is implemented as an independent file system service that works with any computation framework. In our experiments Stargate's performance on heavy I/O workloads was 7% faster than WebHDFS and only 8% slower than HDFS. In addition, Stargate's caches effectively trade high-cost WAN traffic for low-cost LAN traffic. Stargate's performance, on-demand data access, and reduction in WAN traffic make it a good platform for providing remote dataset access to scientific computations on clusters.
KW - WAN
KW - WAN file system
KW - cluster-to-cluster data transfer
KW - file system
KW - on-demand remote data access
KW - remote data access
KW - wide-area network
UR - http://www.scopus.com/inward/record.url?scp=85104985846&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104985846&partnerID=8YFLogxK
U2 - 10.1145/3412841.3441635
DO - 10.1145/3412841.3441635
M3 - Conference contribution
AN - SCOPUS:85104985846
T3 - Proceedings of the ACM Symposium on Applied Computing
SP - 32
EP - 39
BT - Proceedings of the 36th Annual ACM Symposium on Applied Computing, SAC 2021
PB - Association for Computing Machinery
Y2 - 22 March 2021 through 26 March 2021
ER -