Paper Reading List

Background Reading

  • J. Von Neumann. Probabilistic logics and synthesis of reliable organisms from unreliable components. Automata Studies, 1956.
  • J. Hennessey. The Future of Systems Research. IEEE Computer 1999.
  • A. Avizienis et al. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 2004.


  • The Matter of Heartbleed link
  • Meltdown: Reading Kernel Memory from User Space link

Hardware Resilliency

  • T. Karnik and P. Hazucha. Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Tran. on Dependable and Secure Computing 2004.
  • B. Schroeder, E. Pinheiro, and W. Weber. DRAM errors in the wild: a large-scale field study. SIGMETRICS 2009. link
  • J. Bartlett, J. Gray, and B. Horst. Fault tolerance in tandem computer systems. The Evolution of Fault-Tolerant Computing. Springer, 1987.
  • W. Bartlett and L. Spainhower. Commercial fault tolerance: a tale of two systems. IEEE Trans. Dependable Sec. Comput. 2004. link
  • J. Oplinger and M. S. Lam. Enhancing Software Reliability with Speculative Threads. ISCA 2002.
  • K-S. Yim et al. Hauberk: Lightweight silent data corruption error detector for GPGPU. IPDPS 2011.
  • J. H. Patel and L. Y. Fung. Concurrent Error Detection in ALU’s by Recomputing with Shifted Operands. IEEE Tran. on Computers 1982.

Software Systems

Software Fault Tolerance

  • Knight et al., An experimental evaluation of the assumption of independence in multiversion programming. IEEE TSE 1986 link
  • Akidau et al., MillWheel: Fault-Tolerant Stream Processing at Internet Scale. VLDB 2013 (Oct 3rd) link
  • A. Avizienis. The N-version approach to fault-tolerant software. IEEE Tran. on software engineering 1985.
  • Y. C. (Bob) Yeh. Design Considerations in Boeing 777 Fly-By-Wire Computers. High-Assurance Systems Engineering Symposium, 1998.
  • G. Candea et al. Microreboot: A technique for cheap recovery. OSDI 2004.
  • A. S. Tannenbaum, J.N. Herder, H. Bos. Can we make operating systems reliable and secure? IEEE Computer 2006.
  • K-H. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Trans. Computers
  • C Pham et al. Building Reliable and Secure Virtual Machines Using Architectural Invariants. DSN 2014.
  • K. Pattabiraman, Z. Kalbarczyk and R. K. Iyer. Automated Derivation of Application-Aware Error Detectors Using Static Analysis: The Trusted Illiac Approach. IEEE Trans. Dependable Sec. Comput. 2011.
  • Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE Tran. on Software Engineering 2001.
  • Pattabiraman et al., SymPLFIED: Symbolic Program-Level Fault Injection and Error Detection Framework , IEEE Transaction on Computers 2013 (Oct 10) link
  • Huang et al., Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Transaction on Computers 1984 (Oct 10) link

Distributed Systems

  • F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 90.
  • Lidong Zhou, Fred B. Schneider and Robbert Van Renesse. COCA: A secure distributed online certification authority. ACM Tran. on Computer Systems 2002.
  • E Brewer. CAP twelve years later: How the “rules” have changed. IEEE Computer 2012.
  • M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 12.
  • Diego Ongaro, John Ousterhout. In Search of an Understandable Consensus Algorithm (Extended Version) (Oct 24) link
  • Chandra et al. Paxos Made Live - An Engineering Perspective (Oct 24) link
  • Basile et al., Group Communication Protocols under Errors , Proceedings of the 22nd International Symposium on Reliable Distributed Systems (SRDS’03) (Oct 17) link
  • DeCandia et al., Dynamo: Amazon’s Highly Available Key-value Store , SOSP 2007 (Oct 17) link

Hyperscale Reliability

  • C. DiMartino et al. Lessons learned from the analysis of system failures at petascale: The case of blue waters. DSN 2014.
  • H. S. Gunawi et al. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. SoCC 2016.
  • B. Schroeder, G. A. Gibson. Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? FST 2007.
  • J. Meza, Q. Wu, S. Kumar, O. Mutlu. A Large-Scale Study of Flash Memory Failures in the Field. SIGMETRICS 2015.

Fault Tolerance and Data-Science

  • M.C. Hseuh, R.K. Iyer and K.S. Trivedi. Performability Modeling Based on Real Data: a case study. IEEE Tran. on Compupters 1988.
  • S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. EuroSys 2013.
  • J. Bornholt, T. Mytkowicz, K. S. McKinley. Uncertain<T>: A First-Order Type for Uncertain Data. ASPLOS 2014.
  • Herodotos Herodotou et al. Scalable near real-time failure localization of data center networks. KDD 14.
  • Oberheide et al., CloudAV: N-Version Antivirus in the Network Cloud, USENIX SEC 2008, link

Approximate Computing

  • H. Cho, L. Leem, S. Mitra. ERSA: Error Resilient System Architecture for Probabilistic Applications. IEEE TCAD
  • S. Liu, K. Pattabiraman, T. Moscibroda, B. Zorn. Flikker: Saving DRAM Refresh-power through Critical Data Partitioning. ASPLOS 2011.
  • H. Esmaeilzadeh, A. Sampson, L. Ceze, D. Burger. Neural Acceleration for General-Purpose Approximate Programs. MICRO 2012.

Modeling, Simulation and Experimental Validation

  • R. K. Iyer et al. Measurement and Modeling of Computer Reliability as Affected by System Activity. ACM Tran. Comput. Sys. 1986.
  • Gwan S. Choi, Zbigniew T. Kalbarczyk. Wear-Out Simulation Environment for VLSI Designs. FTCS 1993.
  • K. M. Greenan, J. S. Plank, J. J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. HotStorage 2010.
  • W. Gu, Z. Kalbarczyk and R. K. Iyer. Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors. DSN 2004
  • R. K. Iyer, L. T. Young and P. V. K. Iyer. Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data. IEEE Trans. Computers 1990.
  • A. Reibman, R. Smith, and K. S. Trivedi. Markov and Markov Reward Model Transient Analysis: An Overview of Numerical Approaches. European Journal of Operational Research 1989.
  • D. M. Nicol, W. H. Sanders and K. S. Trivedi. Model-based evaluation: from dependability to security. IEEE Trans. Dependable Sec. Comput. 2004
  • Z. Kalbarczyk et al. Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation. IEEE Tran. on Software Engineering 1999.
  • L. Wang et al. Modeling coordinated checkpointing for large-scale supercomputers. DSN 2005. (Oct 3rd) link
  • D. P. Siewiorek, R. Chillarege and Z. T. Kalbarczyk. Reflections on industry trends and experimental research in dependability. IEEE Trans. Dependable Sec. Comput. 2004
  • E. N. Elnozahy and J. S. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans. Dependable Sec. Comput. 2004

Blockchain and smart contracts

  • M. Herlihy Blockchains From a Distributed Computing Perspective CACM 2019.
  • E. Androulaki et al. Hyperledger fabric: a distributed operating system for permissioned blockchains EuroSys 2018.
  • S. Nakamoto Bitcoin: A Peer-to-Peer Electronic Cash System 2009.
  • Y. Wang et al. Formal Specification and Verification of Smart Contracts for Azure Blockchain arXiv 2019.