Paper Reading List
Background Reading
- J. Von Neumann. Probabilistic logics and synthesis of reliable organisms from unreliable components. Automata Studies, 1956.
- J. Hennessey. The Future of Systems Research. IEEE Computer 1999.
- A. Avizienis et al. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 2004.
Security
Hardware Resilliency
- T. Karnik and P. Hazucha. Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Tran. on Dependable and Secure Computing 2004.
- B. Schroeder, E. Pinheiro, and W. Weber. DRAM errors in the wild: a large-scale field study. SIGMETRICS 2009. link
- J. Bartlett, J. Gray, and B. Horst. Fault tolerance in tandem computer systems. The Evolution of Fault-Tolerant Computing. Springer, 1987.
- W. Bartlett and L. Spainhower. Commercial fault tolerance: a tale of two systems. IEEE Trans. Dependable Sec. Comput. 2004. link
- J. Oplinger and M. S. Lam. Enhancing Software Reliability with Speculative Threads. ISCA 2002.
- K-S. Yim et al. Hauberk: Lightweight silent data corruption error detector for GPGPU. IPDPS 2011.
- J. H. Patel and L. Y. Fung. Concurrent Error Detection in ALU’s by Recomputing with Shifted Operands. IEEE Tran. on Computers 1982.
Software Systems
Software Fault Tolerance
- Knight et al., An experimental evaluation of the assumption of independence in multiversion programming. IEEE TSE 1986 link
- Akidau et al., MillWheel: Fault-Tolerant Stream Processing at Internet Scale. VLDB 2013 (Oct 3rd) link
- A. Avizienis. The N-version approach to fault-tolerant software. IEEE Tran. on software engineering 1985.
- Y. C. (Bob) Yeh. Design Considerations in Boeing 777 Fly-By-Wire Computers. High-Assurance Systems Engineering Symposium, 1998.
- G. Candea et al. Microreboot: A technique for cheap recovery. OSDI 2004.
- A. S. Tannenbaum, J.N. Herder, H. Bos. Can we make operating systems reliable and secure? IEEE Computer 2006.
- K-H. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Trans. Computers
- C Pham et al. Building Reliable and Secure Virtual Machines Using Architectural Invariants. DSN 2014.
- K. Pattabiraman, Z. Kalbarczyk and R. K. Iyer. Automated Derivation of Application-Aware Error Detectors Using Static Analysis: The Trusted Illiac Approach. IEEE Trans. Dependable Sec. Comput. 2011.
- Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE Tran. on Software Engineering 2001.
- Pattabiraman et al., SymPLFIED: Symbolic Program-Level Fault Injection and Error Detection Framework , IEEE Transaction on Computers 2013 (Oct 10) link
- Huang et al., Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Transaction on Computers 1984 (Oct 10) link
Distributed Systems
- F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 90.
- Lidong Zhou, Fred B. Schneider and Robbert Van Renesse. COCA: A secure distributed online certification authority. ACM Tran. on Computer Systems 2002.
- E Brewer. CAP twelve years later: How the “rules” have changed. IEEE Computer 2012.
- M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 12.
- Diego Ongaro, John Ousterhout. In Search of an Understandable Consensus Algorithm (Extended Version) (Oct 24) link
- Chandra et al. Paxos Made Live - An Engineering Perspective (Oct 24) link
- Basile et al., Group Communication Protocols under Errors , Proceedings of the 22nd International Symposium on Reliable Distributed Systems (SRDS’03) (Oct 17) link
- DeCandia et al., Dynamo: Amazon’s Highly Available Key-value Store , SOSP 2007 (Oct 17) link
Hyperscale Reliability
- C. DiMartino et al. Lessons learned from the analysis of system failures at petascale: The case of blue waters. DSN 2014.
- H. S. Gunawi et al. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. SoCC 2016.
- B. Schroeder, G. A. Gibson. Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? FST 2007.
- J. Meza, Q. Wu, S. Kumar, O. Mutlu. A Large-Scale Study of Flash Memory Failures in the Field. SIGMETRICS 2015.
Fault Tolerance and Data-Science
- M.C. Hseuh, R.K. Iyer and K.S. Trivedi. Performability Modeling Based on Real Data: a case study. IEEE Tran. on Compupters 1988.
- S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. EuroSys 2013.
- J. Bornholt, T. Mytkowicz, K. S. McKinley. Uncertain<T>: A First-Order Type for Uncertain Data. ASPLOS 2014.
- Herodotos Herodotou et al. Scalable near real-time failure localization of data center networks. KDD 14.
- Oberheide et al., CloudAV: N-Version Antivirus in the Network Cloud, USENIX SEC 2008, link
Approximate Computing
- H. Cho, L. Leem, S. Mitra. ERSA: Error Resilient System Architecture for Probabilistic Applications. IEEE TCAD
- S. Liu, K. Pattabiraman, T. Moscibroda, B. Zorn. Flikker: Saving DRAM Refresh-power through Critical Data Partitioning. ASPLOS 2011.
- H. Esmaeilzadeh, A. Sampson, L. Ceze, D. Burger. Neural Acceleration for General-Purpose Approximate Programs. MICRO 2012.
Modeling, Simulation and Experimental Validation
- R. K. Iyer et al. Measurement and Modeling of Computer Reliability as Affected by System Activity. ACM Tran. Comput. Sys. 1986.
- Gwan S. Choi, Zbigniew T. Kalbarczyk. Wear-Out Simulation Environment for VLSI Designs. FTCS 1993.
- K. M. Greenan, J. S. Plank, J. J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. HotStorage 2010.
- W. Gu, Z. Kalbarczyk and R. K. Iyer. Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors. DSN 2004
- R. K. Iyer, L. T. Young and P. V. K. Iyer. Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data. IEEE Trans. Computers 1990.
- A. Reibman, R. Smith, and K. S. Trivedi. Markov and Markov Reward Model Transient Analysis: An Overview of Numerical Approaches. European Journal of Operational Research 1989.
- D. M. Nicol, W. H. Sanders and K. S. Trivedi. Model-based evaluation: from dependability to security. IEEE Trans. Dependable Sec. Comput. 2004
- Z. Kalbarczyk et al. Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation. IEEE Tran. on Software Engineering 1999.
- L. Wang et al. Modeling coordinated checkpointing for large-scale supercomputers. DSN 2005. (Oct 3rd) link
- D. P. Siewiorek, R. Chillarege and Z. T. Kalbarczyk. Reflections on industry trends and experimental research in dependability. IEEE Trans. Dependable Sec. Comput. 2004
- E. N. Elnozahy and J. S. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans. Dependable Sec. Comput. 2004
Blockchain and smart contracts
- M. Herlihy Blockchains From a Distributed Computing Perspective CACM 2019.
- E. Androulaki et al. Hyperledger fabric: a distributed operating system for permissioned blockchains EuroSys 2018.
- S. Nakamoto Bitcoin: A Peer-to-Peer Electronic Cash System Bitcoin.org 2009.
- Y. Wang et al. Formal Specification and Verification of Smart Contracts for Azure Blockchain arXiv 2019.