ECE 511 : Computer Architecture
Fall 2006, Readings

Recommended Text (HJS): Mark D. Hill, Norman P. Jouppi and Gurindar S. Sohi (editors), Readings in Computer Architecture, Morgan Kaufmann, 2000.

Lecture 2: Basic Notions

Ronen, R., Mendelson, A., Lai, K., Lu, S-L., Pollack, F., and Shen, J., "Coming Challenges in Microarchitecture and Architecture," Proceedings of the IEEE, 89(3), 2001.

G.E. Moore, "Cramming More Components onto Integrated Circuits," Electronics, Apr. 1965. (HJS:56)

S. Mazor, "The History of the Microcomputer - Invention and Evolution," Proceedings of the IEEE, 83(12):1601-1608, 1995. (HJS:60)

Lecture 3: Tools

David W. Wall, "Limits of Instruction Level Parallelism," Digital Western Research Laboratory Research Report 93/6, 1993 (extended version of a paper that appeared in ASPLOS 1991: The appendix describes trace-based simulator design).

Lecture 4: Microrachitecture Overview

B. Ramakrishna Rau and Joseph A. Fisher, "Instruction-Level Parallel Processing: History, Overview, and Perspective," The Journal of Supercomputing, 7, 9-50, 1993. (HJS:288)

James E. Smith and Gurindar S. Sohi, The Microarchitecture of Superscalar Processors, Proceedings of the IEEE, 83(12):1609-1624, 1995.

Supplemental Reading

R. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal Research and Development, 11:2533, January 1967

Y.N. Patt, W.W. Hwu, and M.C. Shebanow, "HPS, A New Microarchitecture: Rationale and Introduction," Proceedings of the 18th International Microprogramming Workshop, Asilomar, CA Dec. 1985, pp. 103-108. (HJS:238)

Lecture 5-6: Instruction Fetch

The following paper discusses many of the problems involved in predicting/fetching multiple instructions per cycle:

T. M. Conte, K. N. Menezes, P. M. Mills and B. A. Patel, "Optimization of instruction fetch mechanisms for high issue rates," ISCA 22, 1995.

This is the paper that describes both the gshare predictor and tournament prediction:

Scott McFarling, "Combining Branch Predictors," Digital Western Research Laboratory Technical Note TN-36, June 1993.

Supplementary reading

The following paper describes trace caching, a technique that attacks the problem of predicting and fetching multiple instructions per cycle.

Sanjay Patel, Daniel Holmes Friendly and Yale N. Patt, Evaluation of design options for the trace cache fetch mechanism, IEEE Transactions on Computers, 48(2), 1999.

Other Papers I mentioned in class:

The return address stack is described in the following paper by Kaeli and Emma

David R. Kaeli and Philip G. Emma, Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns, ISCA 18:34-42, 1991.

The problem of predictor interference and the "agree" predictor are discussed in the paper by Sprangle et al

Eric Sprangle, Robert S. Chappell, Mitch Alsup and Yale N. Patt; The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference, ISCA, 24:284-291, 1997.

On Monday, Sep 13 I showed some pipeline diagrams from a paper about the IBM/Sony/Toshiba Cell Processor. That paper was:

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy; Introduction to the Cell multiprocessor, IBM J. Res. & Dev, 49(4/5):589-604, 2005.

You still want to read MORE?

Here is a page I put together with even more information about high throughput instruction fetch.

Lecture 7,8: Speculative Execution

The MIPS R10000 was an early speculative superscalar microprocessor. The design was particularly "clean" and is the pipeline structure we are using as our "standard" pipeline in class. The paper is also particularly well written:

Kenneth C. Yeager, "The Mips R10000 Superscalar Microprocessor," IEEE Micro, April 1996.

The Intel P6 architecture is probably the single most successful processor design of all time. The Pentium Pro, Pentium III and Pentium M are all versions of this same basic architecture.

D.B.Papworth, "Tuning the Pentium Pro Microarchitecture," IEEE Micro, 16(2):8-15, 1996.

Also: the two overview papers I pointed you at several weeks ago are very well written. If you're looking for more info about speculative execution, these are the best papers to start with:

B. Ramakrishna Rau and Joseph A. Fisher, "Instruction-Level Parallel Processing: History, Overview, and Perspective," The Journal of Supercomputing, 7, 9-50, 1993. (HJS:288)

James E. Smith and Gurindar S. Sohi, The Microarchitecture of Superscalar Processors, Proceedings of the IEEE, 83(12):1609-1624, 1995.

Lectures 9,10: Out-of-order issue

Here are three very well written classics:

Robert M. Keller: Look-Ahead Processors, ACM Comput. Surv. 7(4):177-195, 1975.

Shlomo Weiss, James E. Smith: Instruction Issue Logic in Pipelined Supercomputers, IEEE Trans. Computers 33(11):1013-1022, 1984.

Subbarao Palacharla, Norman P. Jouppi, James E. Smith: Complexity-Effective Superscalar Processors, ISCA 24:206-218, 1997.

Click here to go to a page with an excrutiatingly large number of references about out-of-order instruction scheduling.

Lectures 11,12: Memory Dataflow

The following frames the scalability problems with load/store queues extremely well:

Simha Sethumadhavan, Rajagopalan Desikan, Doug Burger, Charles R. Moore, Stephen W. Keckler: Scalable Hardware Memory Disambiguation for High ILP Processors, MICRO 36, 2003.

This paper describes "store sets," an extremely fruitful way to think about memory dependence problems:

George Z. Chrysos, Joel S. Emer: Memory Dependence Prediction using Store Sets, ISCA, 1998.

Here is a thoughtful and carefully done piece on the memory disambiguation problem:

Amir Roth: Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 32, 2005.

It turns out that it is relatively easy to avoid output and anti dependences between inflight memory operations, allowing the store queue to be organized as a small "speculative cache".

Sam S. Stone, Kevin M. Woley, Matthew I. Frank: Address-Indexed Memory Disambiguation and Store-to-Load Forwarding, MICRO 38, 2005.

ECE 511 : Computer Architecture Fall 2006, Readings