Lightweight, high-resolution monitoring for troubleshooting production systems

Sapan Bhatia, Abhishek Kumar, Marc E. Fiuczynski, Larry Peterson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Scopus citations

Abstract

Production systems are commonly plagued by intermittent problems that are difficult to diagnose. This paper describes a new diagnostic tool, called Chopstix, that continuously collects profiles of low-level OS events (e.g., scheduling, L2 cache misses, CPU utilization, I/O operations, page allocation, locking) at the granularity of ex-ecutables, procedures and instructions. Chopstix then reconstructs these events offline for analysis. We have used Chopstix to diagnose several elusive problems in a large-scale production system, thereby reducing these intermittent problems to reproducible bugs that can be debugged using standard techniques. The key to Chopstix is an approximate data collection strategy that incurs very low overhead. An evaluation shows Chopstix requires under 1% of the CPU, under 256KB of RAM, and under 16MB of disk space per day to collect a rich set of system-wide data.

Original languageEnglish (US)
Title of host publicationProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008
PublisherUSENIX Association
Pages103-116
Number of pages14
ISBN (Electronic)9781931971652
StatePublished - 2019
Event8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008 - San Diego, United States
Duration: Dec 8 2008Dec 10 2008

Publication series

NameProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008

Conference

Conference8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008
Country/TerritoryUnited States
CitySan Diego
Period12/8/0812/10/08

ASJC Scopus subject areas

  • Information Systems
  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Lightweight, high-resolution monitoring for troubleshooting production systems'. Together they form a unique fingerprint.

Cite this