David A. Patterson, "Reduced instruction set computers", Communications of the ACM, v.28 n.1, p.8-21, Jan. 1985
Ronen, R., Mendelson, A., Lai, K., Lu, S-L., Pollack, F., and Shen, J., "Coming Challenges in Microarchitecture and Architecture," Proceedings of the IEEE, Vol 89, No. 3, March 2001.
Johnny K.F. Lee and Alan Jay Smith, "Branch Prediction Strategies and Branch Target Buffer Design," IEEE Computer, January, 1984. (This is a 17MB file, it may take a while to download and print.)
Scott McFarling, "Combining Branch Predictors," Digital Western Research Laboratory Technical Note TN-36, June 1993.
Mark D. Hill and Alan Jay Smith, Evaluating Associativity in CPU Caches, IEEE Transactions on Computers, 38(12), 1989. (This is the paper that introduced the notion of the "4 Cs" (compulsory, capacity, conflict and coherence)).
Norman P. Jouppi, Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, ISCA, 1990.
James E. Smith and Gurindar S. Sohi, The Microarchitecture of Superscalar Processors, Proceedings of the IEEE, vol. 83, pp 1609--1624, Dec 1995.
Kenneth C. Yeager, The Mips R10000 Superscalar Microprocessor, IEEE Micro, April 1996.
R. M. Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM Journal of Research and Development, 11(1), 1967.
Robert M. Keller, Look-Ahead Processors, ACM Computing Surveys, 7(4), 1975.
B. Ramakrishna Rau and Joseph A. Fisher, Instruction-Level Parallel Processing: History, Overview, and Perspective, The Journal of Supercomputing, 7, 9-50, 1993.
David I. August, Daniel A. Connors, Scott A. Mahlke, John W. Sias, Kevin M. Crozier, Ben-Chung Cheng, Patrick R. Eaton, Qudus B. Olaniran, and Wen-mei W. Hwu, Integrated Predicated and Speculative Execution in the IMPACT EPIC Archtecture, Proceedings of the 25th International Symposium on Computer Architecture, July, 1998.
Intel's page for the Intel Itanium Architecture Software Developer's Manual.
Dan Ernst, Andrew Hamel, and Todd Austin, Cyclone: A Broadcast-Free Dynamic Instruction Scheduler with Selective Replay, ACM/IEEE 30th Annual International Symposium on Computer Architecture (ISCA-2003), June 2003.
David F. Bacon, Susan L. Graham and Oliver J. Sharp, Compiler Transformations for High-Performance Computing, ACM Computing Surveys, 26(4), 1994. (Sections 6.2.1, 6.2.7 and 6.4.1.)
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo and Rebecca L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor, ISCA, 1996.
Sanjay Patel, Daniel Holmes Friendly and Yale N. Patt, Evaluation of design options for the trace cache fetch mechanism, IEEE Transactions on Computers, 48(2), 1999.