Paper Reading List

Background Reading

J. Von Neumann. Probabilistic logics and synthesis of reliable organisms from unreliable components. Automata Studies, 1956.
J. Hennessey. The Future of Systems Research. IEEE Computer 1999.
A. Avizienis et al. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 2004.

Security

The Matter of Heartbleed link
Meltdown: Reading Kernel Memory from User Space link

Hardware Resilliency

T. Karnik and P. Hazucha. Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Tran. on Dependable and Secure Computing 2004.
B. Schroeder, E. Pinheiro, and W. Weber. DRAM errors in the wild: a large-scale field study. SIGMETRICS 2009. link
J. Bartlett, J. Gray, and B. Horst. Fault tolerance in tandem computer systems. The Evolution of Fault-Tolerant Computing. Springer, 1987.
W. Bartlett and L. Spainhower. Commercial fault tolerance: a tale of two systems. IEEE Trans. Dependable Sec. Comput. 2004. link
J. Oplinger and M. S. Lam. Enhancing Software Reliability with Speculative Threads. ISCA 2002.
K-S. Yim et al. Hauberk: Lightweight silent data corruption error detector for GPGPU. IPDPS 2011.
J. H. Patel and L. Y. Fung. Concurrent Error Detection in ALU’s by Recomputing with Shifted Operands. IEEE Tran. on Computers 1982.

Software Systems

Software Fault Tolerance

Knight et al., An experimental evaluation of the assumption of independence in multiversion programming. IEEE TSE 1986 link
Akidau et al., MillWheel: Fault-Tolerant Stream Processing at Internet Scale. VLDB 2013 (Oct 3rd) link
A. Avizienis. The N-version approach to fault-tolerant software. IEEE Tran. on software engineering 1985.
Y. C. (Bob) Yeh. Design Considerations in Boeing 777 Fly-By-Wire Computers. High-Assurance Systems Engineering Symposium, 1998.
G. Candea et al. Microreboot: A technique for cheap recovery. OSDI 2004.
A. S. Tannenbaum, J.N. Herder, H. Bos. Can we make operating systems reliable and secure? IEEE Computer 2006.
K-H. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Trans. Computers

C Pham et al. Building Reliable and Secure Virtual Machines Using Architectural Invariants. DSN 2014.
K. Pattabiraman, Z. Kalbarczyk and R. K. Iyer. Automated Derivation of Application-Aware Error Detectors Using Static Analysis: The Trusted Illiac Approach. IEEE Trans. Dependable Sec. Comput. 2011.
Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE Tran. on Software Engineering 2001.
Pattabiraman et al., SymPLFIED: Symbolic Program-Level Fault Injection and Error Detection Framework , IEEE Transaction on Computers 2013 (Oct 10) link
Huang et al., Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Transaction on Computers 1984 (Oct 10) link

Distributed Systems

F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 90.
Lidong Zhou, Fred B. Schneider and Robbert Van Renesse. COCA: A secure distributed online certification authority. ACM Tran. on Computer Systems 2002.
E Brewer. CAP twelve years later: How the “rules” have changed. IEEE Computer 2012.
M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 12.
Diego Ongaro, John Ousterhout. In Search of an Understandable Consensus Algorithm (Extended Version) (Oct 24) link
Chandra et al. Paxos Made Live - An Engineering Perspective (Oct 24) link
Basile et al., Group Communication Protocols under Errors , Proceedings of the 22nd International Symposium on Reliable Distributed Systems (SRDS’03) (Oct 17) link
DeCandia et al., Dynamo: Amazon’s Highly Available Key-value Store , SOSP 2007 (Oct 17) link

Hyperscale Reliability

C. DiMartino et al. Lessons learned from the analysis of system failures at petascale: The case of blue waters. DSN 2014.
H. S. Gunawi et al. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. SoCC 2016.
B. Schroeder, G. A. Gibson. Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you? FST 2007.
J. Meza, Q. Wu, S. Kumar, O. Mutlu. A Large-Scale Study of Flash Memory Failures in the Field. SIGMETRICS 2015.

Fault Tolerance and Data-Science

M.C. Hseuh, R.K. Iyer and K.S. Trivedi. Performability Modeling Based on Real Data: a case study. IEEE Tran. on Compupters 1988.
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. EuroSys 2013.
J. Bornholt, T. Mytkowicz, K. S. McKinley. Uncertain<T>: A First-Order Type for Uncertain Data. ASPLOS 2014.
Herodotos Herodotou et al. Scalable near real-time failure localization of data center networks. KDD 14.
Oberheide et al., CloudAV: N-Version Antivirus in the Network Cloud, USENIX SEC 2008, link

Approximate Computing

H. Cho, L. Leem, S. Mitra. ERSA: Error Resilient System Architecture for Probabilistic Applications. IEEE TCAD

S. Liu, K. Pattabiraman, T. Moscibroda, B. Zorn. Flikker: Saving DRAM Refresh-power through Critical Data Partitioning. ASPLOS 2011.
H. Esmaeilzadeh, A. Sampson, L. Ceze, D. Burger. Neural Acceleration for General-Purpose Approximate Programs. MICRO 2012.

Modeling, Simulation and Experimental Validation

R. K. Iyer et al. Measurement and Modeling of Computer Reliability as Affected by System Activity. ACM Tran. Comput. Sys. 1986.
Gwan S. Choi, Zbigniew T. Kalbarczyk. Wear-Out Simulation Environment for VLSI Designs. FTCS 1993.
K. M. Greenan, J. S. Plank, J. J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. HotStorage 2010.
W. Gu, Z. Kalbarczyk and R. K. Iyer. Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors. DSN 2004
R. K. Iyer, L. T. Young and P. V. K. Iyer. Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data. IEEE Trans. Computers 1990.
A. Reibman, R. Smith, and K. S. Trivedi. Markov and Markov Reward Model Transient Analysis: An Overview of Numerical Approaches. European Journal of Operational Research 1989.
D. M. Nicol, W. H. Sanders and K. S. Trivedi. Model-based evaluation: from dependability to security. IEEE Trans. Dependable Sec. Comput. 2004
Z. Kalbarczyk et al. Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation. IEEE Tran. on Software Engineering 1999.
L. Wang et al. Modeling coordinated checkpointing for large-scale supercomputers. DSN 2005. (Oct 3rd) link
D. P. Siewiorek, R. Chillarege and Z. T. Kalbarczyk. Reflections on industry trends and experimental research in dependability. IEEE Trans. Dependable Sec. Comput. 2004
E. N. Elnozahy and J. S. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans. Dependable Sec. Comput. 2004

Blockchain and smart contracts

M. Herlihy Blockchains From a Distributed Computing Perspective CACM 2019.
E. Androulaki et al. Hyperledger fabric: a distributed operating system for permissioned blockchains EuroSys 2018.
S. Nakamoto Bitcoin: A Peer-to-Peer Electronic Cash System Bitcoin.org 2009.
Y. Wang et al. Formal Specification and Verification of Smart Contracts for Azure Blockchain arXiv 2019.