TY - GEN
T1 - SDM
T2 - 15th IEEE International Conference on eScience, eScience 2019
AU - Choi, Illyoung
AU - Nelson, Jude
AU - Peterson, Larry Lee
AU - Hartman, John
N1 - Funding Information:
ACKNOWLEDGMENTS We thank Zack Williams and Jack L Pogue III for their contributions to Syndicate. We thank Dr. Bonnie Hurwitz and the members of the Hurwitz Lab for feedback for the system design and usage scenarios. We thank Dr. Nirav Merchant and all members of CyVerse for providing compute and storage resources. This research was funded in part by NSF grants OAR-1640775 and OAR-1541318.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - Scientific computing is becoming more data-centric and more collaborative, requiring increasingly large datasets to be transferred across the Internet. Transferring these datasets efficiently and making them accessible to scientific workflows is an increasingly difficult task. In addition, the data transfer time can be a significant portion of the overall workflow running time. This paper presents SDM (Syndicate Dataset Manager), a scientific dataset delivery platform. Unlike general-purpose data transfer tools, SDM offers on-demand access to remote scientific datasets. On-demand access doesn't require staging datasets to local file systems prior to computing on them, and it also enables overlapping computation and I/O. In addition, SDM offers a simple interface for users to locate and access datasets. To validate the usefulness of SDM, we performed realistic metagenomic sequence analysis workflows on remote genomic datasets. In general, SDM configured with a CDN outperforms existing data access methods. With warm CDN caches, SDM completes the workflow 17-20% faster than staging methods. Its performance is even comparable to local storage. SDM is only 9% slower than local HDD storage and 18% slower than local SSD storage. Together, its performance and its ease-of-use make SDM an attractive platform for performing scientific workflows on remote datasets.
AB - Scientific computing is becoming more data-centric and more collaborative, requiring increasingly large datasets to be transferred across the Internet. Transferring these datasets efficiently and making them accessible to scientific workflows is an increasingly difficult task. In addition, the data transfer time can be a significant portion of the overall workflow running time. This paper presents SDM (Syndicate Dataset Manager), a scientific dataset delivery platform. Unlike general-purpose data transfer tools, SDM offers on-demand access to remote scientific datasets. On-demand access doesn't require staging datasets to local file systems prior to computing on them, and it also enables overlapping computation and I/O. In addition, SDM offers a simple interface for users to locate and access datasets. To validate the usefulness of SDM, we performed realistic metagenomic sequence analysis workflows on remote genomic datasets. In general, SDM configured with a CDN outperforms existing data access methods. With warm CDN caches, SDM completes the workflow 17-20% faster than staging methods. Its performance is even comparable to local storage. SDM is only 9% slower than local HDD storage and 18% slower than local SSD storage. Together, its performance and its ease-of-use make SDM an attractive platform for performing scientific workflows on remote datasets.
KW - Cloud storage
KW - Data delivery platform
KW - Data transfer
KW - Scientific computing
KW - Wide-Area network
UR - http://www.scopus.com/inward/record.url?scp=85083188481&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083188481&partnerID=8YFLogxK
U2 - 10.1109/eScience.2019.00049
DO - 10.1109/eScience.2019.00049
M3 - Conference contribution
AN - SCOPUS:85083188481
T3 - Proceedings - IEEE 15th International Conference on eScience, eScience 2019
SP - 378
EP - 387
BT - Proceedings - IEEE 15th International Conference on eScience, eScience 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 24 September 2019 through 27 September 2019
ER -