Recommended Text (HJS): Mark D. Hill, Norman P. Jouppi and Gurindar S. Sohi (editors), Readings in Computer Architecture, Morgan Kaufmann, 2000.
Ronen, R., Mendelson, A., Lai, K., Lu, S-L., Pollack, F., and Shen, J., "Coming Challenges in Microarchitecture and Architecture," Proceedings of the IEEE, 89(3), 2001.
G.E. Moore, "Cramming More Components onto Integrated Circuits," Electronics, Apr. 1965. (HJS:56)
S. Mazor, "The History of the Microcomputer - Invention and Evolution," Proceedings of the IEEE, 83(12):1601-1608, 1995. (HJS:60)
David W. Wall, "Limits of Instruction Level Parallelism," Digital Western Research Laboratory Research Report 93/6, 1993 (extended version of a paper that appeared in ASPLOS 1991: The appendix describes trace-based simulator design).
B. Ramakrishna Rau and Joseph A. Fisher, "Instruction-Level Parallel Processing: History, Overview, and Perspective," The Journal of Supercomputing, 7, 9-50, 1993. (HJS:288)
James E. Smith and Gurindar S. Sohi, The Microarchitecture of Superscalar Processors, Proceedings of the IEEE, 83(12):1609-1624, 1995.
R. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal Research and Development, 11:2533, January 1967
Y.N. Patt, W.W. Hwu, and M.C. Shebanow, "HPS, A New Microarchitecture: Rationale and Introduction," Proceedings of the 18th International Microprogramming Workshop, Asilomar, CA Dec. 1985, pp. 103-108. (HJS:238)
The following paper discusses many of the problems involved in predicting/fetching multiple instructions per cycle:
T. M. Conte, K. N. Menezes, P. M. Mills and B. A. Patel, "Optimization of instruction fetch mechanisms for high issue rates," ISCA 22, 1995.
This is the paper that describes both the gshare predictor and tournament prediction:
Scott McFarling, "Combining Branch Predictors," Digital Western Research Laboratory Technical Note TN-36, June 1993.
The following paper describes trace caching, a technique that attacks the problem of predicting and fetching multiple instructions per cycle.
Sanjay Patel, Daniel Holmes Friendly and Yale N. Patt, Evaluation of design options for the trace cache fetch mechanism, IEEE Transactions on Computers, 48(2), 1999.
The return address stack is described in the following paper by Kaeli and Emma
David R. Kaeli and Philip G. Emma, Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns, ISCA 18:34-42, 1991.
The problem of predictor interference and the "agree" predictor are discussed in the paper by Sprangle et al
Eric Sprangle, Robert S. Chappell, Mitch Alsup and Yale N. Patt; The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference, ISCA, 24:284-291, 1997.
On Monday, Sep 13 I showed some pipeline diagrams from a paper about the IBM/Sony/Toshiba Cell Processor. That paper was:
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy; Introduction to the Cell multiprocessor, IBM J. Res. & Dev, 49(4/5):589-604, 2005.
The MIPS R10000 was an early speculative superscalar microprocessor. The design was particularly "clean" and is the pipeline structure we are using as our "standard" pipeline in class. The paper is also particularly well written:
Kenneth C. Yeager, "The Mips R10000 Superscalar Microprocessor," IEEE Micro, April 1996.
The Intel P6 architecture is probably the single most successful processor design of all time. The Pentium Pro, Pentium III and Pentium M are all versions of this same basic architecture.
D.B.Papworth, "Tuning the Pentium Pro Microarchitecture," IEEE Micro, 16(2):8-15, 1996.
Also: the two overview papers I pointed you at several weeks ago are very well written. If you're looking for more info about speculative execution, these are the best papers to start with:
B. Ramakrishna Rau and Joseph A. Fisher, "Instruction-Level Parallel Processing: History, Overview, and Perspective," The Journal of Supercomputing, 7, 9-50, 1993. (HJS:288)
James E. Smith and Gurindar S. Sohi, The Microarchitecture of Superscalar Processors, Proceedings of the IEEE, 83(12):1609-1624, 1995.
Robert M. Keller: Look-Ahead Processors, ACM Comput. Surv. 7(4):177-195, 1975.
Shlomo Weiss, James E. Smith: Instruction Issue Logic in Pipelined Supercomputers, IEEE Trans. Computers 33(11):1013-1022, 1984.
Subbarao Palacharla, Norman P. Jouppi, James E. Smith: Complexity-Effective Superscalar Processors, ISCA 24:206-218, 1997.
The following frames the scalability problems with load/store queues extremely well:
Simha Sethumadhavan, Rajagopalan Desikan, Doug Burger, Charles R. Moore, Stephen W. Keckler: Scalable Hardware Memory Disambiguation for High ILP Processors, MICRO 36, 2003.
This paper describes "store sets," an extremely fruitful way to think about memory dependence problems:
George Z. Chrysos, Joel S. Emer: Memory Dependence Prediction using Store Sets, ISCA, 1998.
Here is a thoughtful and carefully done piece on the memory disambiguation problem:
Amir Roth: Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 32, 2005.
It turns out that it is relatively easy to avoid output and anti dependences between inflight memory operations, allowing the store queue to be organized as a small "speculative cache".
Sam S. Stone, Kevin M. Woley, Matthew I. Frank: Address-Indexed Memory Disambiguation and Store-to-Load Forwarding, MICRO 38, 2005.