Contributor: Charles Clancy tclancy@uiuc.edu
Q: Which of the following dependencies can cause potential problems for an
out-of-order processor, but not an in-order pipelined processor?
a. write-after-write
b. read-after-write
c. write-after-read
d. read-after-read
Q: Refer to "Coming Challenges in Microarchitecture and Architecture"
What is instruction reuse?
a. an attempt to increase the instruction supply by internally buffering
instructions that could possibly be part of a loop
b. caching the operands and result of long-latency instructions, for
later table lookup which avoid the long computations
c. a compiler optimization to decrease code size by locating blocks of
identical instructions and replacing them with function calls
d. a compiler optimization to increase efficiency by unrolling loops,
preventing the execution of extra branch instructions.
Contributor: Weining Gu wgu@uiuc.edu
Q: Which of the following statements are true?
a. Memory traffic is the sum of the instruction traffic plus the data
traffic. [Flynn 1987] shows that in order to reduce the relative
traffic, it is undesirable to increase both the register set size
and the instruction complexity.
b. [Colwell 1985] claims that as chip technology improves, the
tradeoffs between instruction set complexity and architecture/
implementation features become more constrained.
c. Branch Target Buffer(BTB)[Lee 1984] functions partly as follows:
during an instruction's execution stage, this instruction's address
is compared against the instruction address in the BTB. If matched,
branch target address in BTB could be used, based on the result of
branch prediction.
d. In the course of an instruction's execution, the branch history
table is updated when the branch predicate is resolved (usually in
the decode or execute stage), which is when we find out whether the
branch was actually taken or not.
Q: Which of the following statements are true?
a. A higher set associativity in a cache improves the hit rate because
it reduces conflict misses, but it might also increase the hit time.
b. The Tomasulo algorithm does not need to check for RAW dependencies
because these dependencies are eliminated by the register renaming
mechanism.
c. In [Tomasulo 1967], Common Data Bus(CDB) is fed by many sources. The
source, which first requests CDB, gets the right to outgate its
result and broadcast the tag to all reservation stations. No central
priority circuit is needed.
d. When dynamic instruction scheduling is used, a considerable
investment in hardware is required to analyze data dependencies
while the program is running.
Contributor: Daniel Herring dherring@uiuc.edu
Q: Given a banked cache design of M banks and an N-bit data/instruction fetch per cycle, what is the minimum bank width in bits? Justify.
1.) N/M
2.) (N-1)/M
3.) N/(M-1)
4.) (N-1)/(M-1)
Q: Do the following arguments validly support a single-chip-single-processor design? Explain.
1.) Memory bandwidth is a limiting factor in computer design. [y/n]
2.) As technology improves, each chip will hold more transistors. [y/n]
3.) This design type is easier than chip multiprocessor designs. [y/n]
Contributor: Geoff Kent gkent@uiuc.edu
Q: Given an N-bit global history register and a 2^N entry counter table
for branch prediction, what is used to index the table if g-share is the
chosen indexing method?
a) A concatenation of the M most recent bits of the GHR with the lower N-M bits of the program counter
b) An XOR of the N-bit GHR and the lower N-bits of the PC
c) An XOR of the N-bit GHR and the middle N-bits of the PC
d) The N-bits of the GHR
Q: If a 3-bit history table is used in a two-level local history branch
prediction scheme, which of the following loop branch taken pattern pairs
will cause branch prediction interference in the counter table.
a) (0110)^n - (1110)^n
b) (0001)^n - (1101)^n
c) (1011)^n - (0101)^n
d) (1011)^n - (1101)^n
Contributor: Tyler Ralston tralston@uiuc.edu
Q: What are the advantages of Dynamic Scheduling?
a. It enables handling some cases when dependences are unknown at compile time
b. It allows the compiler to have a more complicated design.
c. It allows code compiled with one pipeline to run efficiently on a different
pipeline.
d. It tries to minimize stalls by separating dependent instructions so that
they will not lead to hazards.
Q: Some of the challenges of multiple issue with Tomasulo's algorithm are:
a. issuing of the instructions
b. monitoring the CDBs for instruction completion.
c. hazards of speculating when there are data-dependent branches.
d. maintaining a through-put of more than one instruction per cycle.
Contributor: Amit Patel ajpatel2@uiuc.edu
Q: Memory-memory instruction set architectures, such as VAX, are rarely used or implemented today for the following reasons:
A. large memory operand address specifiers force code size to be larger and
less
compact in general compared to register-memory architectures.
B. a high number of memory accesses creates a memory bottleneck.
C. potentially more work per instruction forces a higher CPI compared to
both register-memory and register-register architectures, thus
making these CPUs non-marketable.
D. instruction format tends to be restrictive and difficult to encode for
assembly programmers and compiler writers.
Q: Negative interference among Global History table entries is a major
problem. Which of the following approaches decreases negative interference
or avoids its effect?
A. use a set-associative BTB cache design opposed to a direct-mapped BTB cache design.
B. increase the recorded history and pattern history table size.
C. increase the BTB size to at least the pattern history table size.
D. map instructions to the pattern history table based on current history and
the PC.
Contributor: Galen Rasche rasche@uiuc.edu
Q: Which of the following statements about cache performance is/are true?
A. Bus turnaround plays a more significant role in performance than
the number of banks attached to the memory channel.
B. Optimal burst size is primarily dependent on channel width.
C. Associativity leads to fewer conflict misses than direct-mapped cache.
D. Increasing line size always leads to increased performance, but is
generally not used because it decreases the feasible number of banks.
Q: Which of the following directly influences the number of stalls in a
pipeline?
A. Simultaneous multithreading
B. Soft error rate
C. Memory-level parallelism
D. The use of SIMD extensions
Contributor: David Schmitt deschmit@uiuc.edu
Q: In Intel's trace based instruction caching patent they describe the manner in which cached instructions are placed in a data array. In the
example given in the patent, predecessor and successor trace segments
are placed in set ((x-1) modulo S) and ((x+1) modulo S) respectively (x
is the set number and S is the total number of sets). The predecessor
and successor segment may also be placed in anyone of 4 possible ways
(or banks). To do this the tag must hold four extra bits. Two bits
point to the predecessor way and two bits point to the successor way.
Why not save space and restrict the predecessor and successor ways to
((current_way - 1) modulo 4) and ((current_way + 1) modulo 4) respectively?
A) The control for this logic is more complicated.
B) If((current_way + 1) modulo 4) is already full, the current trace can
not grow unless the whole new trace is moved to all new
locations or the old trace is destroyed.
C) Restricting the predecessor and successor ways decreases the
probability of finding large runs of free trace segments. Much
like, when we don't use paging in main memory, we get
fragmentation.
D) Intel's, engineers were out drinking the night before they came
up with the design.
Q: If DRAM addresses are divided into four groups of bits, what is the order.
A)
----------------------------------------------
| page offset | bank | channel | row |
----------------------------------------------
B)
----------------------------------------------
| row | bank | channel | page offset |
----------------------------------------------
C)
----------------------------------------------
| row | channel | bank | page offset |
----------------------------------------------
D)
----------------------------------------------
| bank | channel | row | page offset |
----------------------------------------------
Contributor: Nicholas Wang nwang@crhc.uiuc.edu
Q: What kind of branch predictor (local or global history) are each of the
branches in the code fragment below best suited for? In other words,
local and global history branch predictors are designed to take advantage of
certain properties of branches. Which branch predictor is designed to take
advantage of the properties exhibited by the branches below? Assume that all "if" and "while" statements are branches.
d = 1;
do {
for (i=0; i<5; i++) { /* branch (a) */
random = Generate_Random_Number();
d = d + array[random];
if (array[random] < 0)
printf("Negative!\n");
if (array[random] >= 0) /* branch (b) */
printf("Not Negative!\n");
}
if (d < 0)
printf("Not done yet\n");
if (0 <= d < 20)
printf("Getting closer\n");
if (d >= 20)
printf("We're done!!!!!\n");
if (i == 5) /* branch (c) */
printf("I am highly predictable\n");
} while (d < 20) /* branch (d) */
Q: Increasing the length of the global history register in a global branch
prediction scheme will eventually decrease performance.. Why? Assume that all the bits in the GHR are used to index the pattern history table.
a) the pattern history table will take longer to warm up
b) speculative update of the global history register is less effective as the
history length is increased
c) speculative update of a larger pattern history table creates interference
d) the additional global history information obtained generally isn't highly
correlated to the branch in question
Contributor: Ilyas Ayub ayub@uiuc.edu
Q: We have two processors A and B. B's frequency is twice that of A,
and B's operating voltage is half of A. Everything else about these two
processors are the same. What is the power output of B?
a. half of A's power output
b. twice that of A's power output
c. the same as A's power output
d. 4 times as much of A's power output
Contributor: Lee Baugh leebaugh@students.uiuc.edu
Q. Between two adjacent levels of cache, Lk and Lk+1, what differences
*could* affect the inclusion properties of the memory system?:
a. 1. cache line size
2. number of ports
3. set associativity
4. number of blocks
Q: Given a 4-way set-associative unified cache, which of the following
statements is true?
a. 1. a 4-entry fully assoc. miss cache may improve mem. performance
2. a 2-entry fully assoc. victim cache may improve mem. performance
3. dividing the cache into two banks may improve mem. performance
4. dividing the cache into two banks may improve fetch bandwidth
Contributor: Steve Lindemann slindema@students.uiuc.edu
Q: In "And Now a Case for More Complex Instruction Sets", Flynn et al.
conclude that adding support for (some) register-memory instructions and
half-length instructions is useful. What are primary reasons they give for
this?
1) The number of cycles required to execute the program is decreased due
to more powerful instructions.
2) Memory traffic (bytes transferred, cache misses) is reduced due to
shorter program length.
3) Register-memory operations which operate in conjunction with branches
can save cycles.
4) Compilation time is reduced due to a more rich instruction set.
Q: Characteristics of a "Billion Transistor Processor" include:
1) Over half of its transistors devoted to a second-level cache.
2) A multilevel, pipelined trace cache.
3) A Multi-Hybrid branch predictor with high enough accuracy that
compiler "hints" are unnecessary.
4) A first-level cache which is replicated to support multiporting
requirements.
Contributor: Rob Mihalko rjmihalk@students.uiuc.edu
Q: A 1.80GHz Intel Pentium 4 processor operates at 1.75V, is manufactured
using a .18 micron process, and consumes 66.1W of power. Find the savings in
power consumption if the operating voltage were reduced to 1.65V.
a) 9.83W
b) 7.34W
c) 3.78W
d) 12.34W
Contributor: Naveen Neelakantam neelakan@students.uiuc.edu
Q: Which of the following design philosophies characterize a RISC?
a) Maintaining binary compatibility between processor generations.
b) Creating an ISA composed of a small set of simple instructions.
c) Providing a large number of architectural registers to the
compiler/programmer.
d) Defining instruction format so that decoding is simplified in hardware.
Q: You are charged with desining the memory interface
for a new shared-bus multiprocessor computer system. Your goal is to tune performance for a given set of applications. What application characteristics will help you make your decision? Assume that total memory bandwith is fixed.
a) The read/write patterns of the applications.
b) The number of processors in the system.
c) The size of the application working sets.
d) Data locality in the applications.
Contributor: Russell Schreiber rschreib@uiuc.edu
Q: Increasing the length of cache lines has the following results:
a) increase compulsory misses
b) decrease compulsory misses
c) increase conflict misses
d) decrease conflict misses
Q: Ways to improve the trace cache performance(effective fetch rate) by a signifcant amount are:
a) increase the speed at which the fill unit assembles trace
cache lines
b) allow the processor to issue instructions from the trace
cache that are not on the predicted path
c) allow the processor to issue part of a trace cache line
if certain segments of the line do not match the
predicted execution path
d) make the trace cache lines shorter so that less branches need to be predicted each cycle
Contributor: Fabrice Stevens
Q: What's the difference betwwen a 1-address machine and a 2-address
machine?
Q: If we have a loop (1001)^n and (0011)^n, what is the shortest number or
bits we have to use in order to avoid interference and conflicts in the counter table? (we're working with local history tables)
Contributor: Charles Vitu cvitu@students.uiuc.edu
Q: According to David Patterson why did CISCs dominate early architectures:
A. CISCs are easier to design than RISCs
B. CISC compilers are easier to design than RICS compilers
C. Control Memory (ROM) is more expensive, in terms of chip area, than Register Memory
D. Program execution time was thought to be proportional to program size, therefore smaller CISC programs would run faster
Q: Given equally sized counter tables: A global history branch prediction
scheme is better than a local history branch prediction scheme in which
of the following ways:
A. Global history prediction is more accurate
B. A global history prediction scheme takes less chip area
C. The "warm-up" time of a global history prediction scheme is not as long
D. Global history is less likely to be effected by inference
Contributor: Ritu Gupta rgupta5@crhc.uiuc.edu
Q: In general, the size of I-cache for a particular application is
determined by cost of implementation for a particular application.
However can the I-cache size be constatnly increased. Is the same trend
seen with respect to trace cache.
Q: The three main factors which affect the delivery of fetch mechanism
are- effective fetch rate, total cache miss penalty, total branch miss
prediction penalty. Are all these three things affected positively by
introducing partial fetch and inactive issue. (This question can also
be formulated as multiple choice type)
Contributor: Chao Huang chuang10@uiuc.edu
Q: In gshare, what is the advantage of XOR of part of branch
address (PC) and part of History Buffer to index Pattern
History Table?
Q: Which of the following is/are true about victim cache?
a. It increases cache associativity to a direct-mapped cache
b. It has better performance on D-cache than I-cache
c. As cache line size increases, more misses can be removed
d. L1 vicitm cache can reduce L2 confilict misses
Contributor: Brian Lam lam1@students.uiuc.edu
Q: Instruction supply can be improved by reducing the number of
stalls or by increasing the fetch bandwidth. The following techniques
address the reductions of the number of stalls by reducing the branch
misprediction penalty.
a) Decoded Instruction Cache
b) Trace Cache
c) Eager Execution
d) Stream Buffers
Q: Which of the following are common characteristics associated
with RISC?
a) Minimal Amount of Addressing modes
b) High Instruction Traffic
c) High CPI
d) Easy to Pipeline
Contributor: Qilun Liu qilunliu@students.uiuc.edu
Q: 2-bit branch prediction can: ____
A. Reduce pipeline stalls in loops
B. Reduce pipeline stalls for a branch
C. Detect illegal instructions
D. Be a mechanism used in the IF stage
Q: The key differences between the scoreboard and TomasuloĄŊs algorithm are:
A.Scoreboard is Static Scheduling algorithm, but Tomasulo is dynamic
scheduling algorithm.
B.Scoreboard stalls the execution to solve WAR WAW hazards, but Tomasulo uses
register renaming to avoid WAR, WAW hazards.
C.Tomasulo uses Reservation Station Components to indicate the status of
registers and FU, but Scoreboard doesn't.
D.Scoreboard is a centralized control scheme but Tomasulo is a distributed
one.
Contributor: Tyler Ralston tralston@uiuc.edu
Q: What are the advantages of Dynamic Scheduling?
a. It enables handling some cases when dependences
are unknown at compile time
b. It allows the compiler to have a more complicated design.
c. It allows code compiled with one pipeline to run efficiently
on a different pipeline.
d. It tries to minimize stalls by separating dependent instructions
so that they will not lead to hazards.
Q: Some of the challenges of multiple issue with Tomasulo's algorithm are
a. issuing of the instructions
b. monitoring the CDBs for instruction completion.
c. hazards of speculating when there are data-dependent branches.
d. maintaining a through-put of more than one instruction per cycle.
Contributor: Esther Resendiz eresendi@uiuc.edu
Q: Which of the following is(are) TRUE of Register Renaming?
A. Register Renaming eliminates register Write After Read
(WAR) hazards.
B. Register Renaming eliminates register Read After Write
(RAW) hazards.
C. The number of physical register names can be larger
than the total number of logical register names.
D. Suppose a logical and physical register correspond to
each other. As soon as the logical register is
renamed by an instruction, the physical register
is freed and can be reclaimed by a subsequent
instruction.
Q: Which of the following is(are) TRUE of DRAMs?
A. Asynchronous DRAMs provide tighter timings than
synchronous DRAMs.
B. Most DRAMs used today are synchronous.
C. A write to DRAM is more of a design constraint than a
read.
D. Direct RDRAM is fast, but at the expense of a high pin
count.
Contributor: Jeff Stine jstine@uiuc.edu
Q: The advantages of using a first level victim cache are:
a) Increased cache performance without increasing the critical
path.
b) Better miss prevention when using longer cache lines.
c) Reduction of second level conflict misses.
d) Satisfies the inclusion property.
Q: Well-established methods for reducing microprocessor power
consumption include:
a) Trading lower clock frequency for increased IPC.
b) Waterfalling.
c) Lowering supply voltage.
d) Increasing the amount of speculative execution.
Contributor: Ying Wang
Q: Choose the right statement about two-level adaptive branch prediction.
A) PAg and PAp have their own PHT for each static conditional branch.
B> PAp can completely avoid the interference.
C> PAp has the highest prediction rate for given history register length
D> PAg has least hardware for 97% accuracy.
Q: Choose the statements you think is reasonable for BTB?
A> BTB can consist branch tag, branch target address and prediction
information.
B> Access BTB using the current instruction address.
C> BTB only provides one prediction per entry.
D> P6 family has two times BTB entries than PPC620 has.
Contributor: Anand Shukla ashukla@uiuc.edu
Q: Superscalars and VLIW processors have their own advantages and
pitfalls. As a designer, one might have to choose one over the other
based on certain constraints:
1. For programs with high inherent ILP, superscalars would end up
consuming more power and use a more complex logic than VLIW.
2. For programs with irregular (and intensive) control flow,
superscalars can extract better performance than simple VLIW.
3. Given same number of ALU, VLIW will take less chip space than a
superscalar.
4. A program compiled for VLIW will occupy larger memory compared to if
it were compiled for superscalar.
Q: Bank conflict problem: Consider a non-blocking cache having 4 banks,
where the bank for a given address is decided by looking at the last 2
bits in the address. Now consider a two dimensional array A with
dimensions 4x8. The elements A[0][0], A[1][0], A[2][0], A[3][0] would
end up hitting the same bank. This problem will not arise if:
1. We use (2^n)+1 dimensions in the array
2. We Use odd numbered dimensions for the array and even numbers of
banks
3. We Use odd numbers of memory banks, so the above problem will never
occur
4. We Use prime numbers of memory banks, and this would give us a unique
(Bank, Offset) coordinates for any address
Q: Modern processors need supply of instructions at high frequency. This
is accomplished by:
1. Multi-level predictors and Branch Target buffers, as they aid in
direction and target prediction
2. Generating an instruction sequence continuous in time despite control
flow transfers using Trace Cache
3. Using multiple sequencers to fetch more aggregate instructions per
cycle, using Polypath execution and Speculative threads
4. Using direct mapped, wide, heavily sub-banked L1 and L2 caches, so
that there are less conflict misses