San Fermín: Aggregating large data sets using a binomial swap forest

Justin Cappos, John H. Hartman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Scopus citations

Abstract

San Fermín is a system for aggregating large amounts of data from the nodes of large-scale distributed systems. Each San Fermín node individually computes the aggregated result by swapping data with other nodes to dynamically create its own binomial tree. Nodes that fall behind abort their trees, thereby reducing overhead. Having each node create its own binomial tree makes San Fermín highly resilient to failures and ensures that the internal nodes of the tree have high capacity, thereby reducing completion time. Compared to existing solutions, San Fermín handles large aggregations better, has higher completeness when nodes fail, computes the result faster, and has better scalability. We analyze the completion time, completeness, and overhead of San Fermín versus existing solutions using analytical models, simulation, and experimentation with a prototype built on peer-to-peer system deployed on PlanetLab. Our evaluation shows that San Fermín is scalable both in the number of nodes and in the aggregated data size. San Fermín aggregates large amounts of data significantly faster than existing solutions: compared to SDIMS, an existing aggregation system, San Fermín computes a 1MB result from 100 PlanetLab nodes in 61-76% of the time and from 2-6 times as many nodes. Even if 10% of the nodes fail during aggregation, San Fermín still includes the data from 97% of the nodes in the result and does so faster than the underlying peer-to-peer system recovers from failures.

Original languageEnglish (US)
Title of host publication5th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2008
PublisherUSENIX Association
Pages147-160
Number of pages14
ISBN (Electronic)9781931971584
StatePublished - 2008
Event5th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2008 - San Francisco, United States
Duration: Apr 16 2008Apr 18 2008

Publication series

Name5th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2008

Conference

Conference5th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2008
Country/TerritoryUnited States
CitySan Francisco
Period4/16/084/18/08

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'San Fermín: Aggregating large data sets using a binomial swap forest'. Together they form a unique fingerprint.

Cite this