The Journal of
Instruction-Level Parallelism |
|
Description
of the Simulation Infrastructure The
provided simulation framework is based on the CMP$im
simulator. The framework models a simple out-of-order core with the following
basic parameters: o 128-entry instruction
window with no scheduling restrictions (i.e., any instruction that is ready
in the window can be scheduled out-of-order). o Processor
has a 15-stage, 4-wide pipeline. A maximum of two loads and a maximum of one
store can be issued every cycle. o Perfect
branch prediction (i.e., no front-end or fetch hazards). o All
instructions have one-cycle latency except for cache misses. L1 cache misses / L2 cache hits are 20
cycles. L2 cache misses / memory
requests have a 200-cycle latency. Hence, the total round-trip time from
processor to memory is 220 cycles. o The memory
model will consist of a 2-level cache hierarchy, consisting of an L1 data
cache and an L2 data cache. The L1 data cache is 32KB 8-way set-associative
with LRU replacement. The L2 data cache is 16-way set-associative with LRU
replacement. The L2 data cache size depends on the configurations specified
later in the document. o Each cache
is coupled with a queue for storing outstanding requests to the next level in
the hierarchy. L1 misses / L2 requests are queued in an L1 queue. L2 misses /
memory requests are queued in an L2 queue. Both demand and prefetch requests to the L1 and the L2 are placed in
these queues in FIFO order. o Prefetches are not
prioritized vs. demand requests, except if they are issued in the same cycle
in which case demand requests are inserted earlier in the queue. The
bandwidth for these queues can be restricted or infinite, as specified later
in the document. o Prefetches are issued
for whole cache blocks. o When a
block comes into the cache, it replaces the LRU block in the cache, whether
it is a demand request or a prefetch request. o Physical
addresses are not used by the prefetcher. Only
virtual addresses can be requested. We
will distribute the framework as a library that includes a header file to
which the contestant will add his/her prefetcher
code. The prefetcher API will have access to a
processor state structure which reflects processor state from the last cycle.
The following is a list of data fields that will be updated every cycle, and
can be polled by the prefetcher API. An
L1 state structure is updated with the following information about events in
the last cycle: o Last cycle
at which the L1 data cache received a request. This could be the current
cycle or a previous cycle if there are no requests in the current cycle. o Virtual
address (program counter) of the instruction making the last L1 data cache
request. o A unique
sequence number of the dynamic instruction making the last L1 data cache
request. o Instruction
type (load, store, other) for the instruction making the last L1 data cache
request. o Virtual
address for the data being requested. o L1 data
cache response (hit/miss). o A single
bit of additional storage to be associated with the cache line for the
contestant to read and/or modify. Up
to four records can be given in a single cycle depending on the number of
requests (up to four) issued to the L1 data cache. If a requested address
misses in the L1 data cache, a request is made to the L2 data cache. Each
cycle, an L2 state structure is updated with the following information: o A unique
sequence number of the dynamic instruction that originally made the request. The
sequence numbers of instructions subsequently requesting the L2 data cache
block are not provided. o Virtual
address of the requested L2 data cache block. o L2 data
cache response (hit/miss). o A single
bit of additional storage to be associated with the cache line for the
contestant to read and/or modify. In
addition to this information, the contestant is provided with 32Kbits of
storage to implement state machines to manage the prefetching
algorithm. This storage limit includes all the state required for the L1 and
the L2 prefetchers. This limit only applies to the
storage available to implement the algorithm. The contestant will not be
penalized for the complexity of the algorithm logic. The prefetching
algorithm will consist of prefetchers for the L1
data cache and the L2 data cache. The 32Kbit storage can be allocated as the
contestant sees fit. Configurations Used for Evaluation Submissions
will be evaluated based on the measured performance of their prefetching algorithms across three different
configurations, that differ in L2 data cache size and cache/memory bandwidth
as follows: o Configuration
1: L2 size is 2 MB. L1 and L2
queues have almost-infinite bandwidth (realistically, that means that these
queues can process up to 1000 requests in parallel). o Configuration
2: L2 size is 2 MB. L1 and L2
queues have limited bandwidth (a maximum of one request per cycle from the L1
to the L2, and a maximum of one request every ten cycles from the L2 to
memory). o Configuration
3: L2 size is 512 KB. L1 and L2
queues have limited bandwidth similar to Configuration 2. |
|