The Journal of
Instruction-Level Parallelism |
Description
of the Simulation Infrastructure The
provided simulation framework is based on the CMP$im
simulator. The framework models a simple out-of-order core with the following
basic parameters: o 128-entry
instruction window with no scheduling restrictions (i.e., any instruction
that is ready in the window can be scheduled out-of-order). o Processor
has an 8-stage, 4-wide pipeline. A maximum of two loads and a maximum of one
store can be issued every cycle. o Perfect
branch prediction (i.e., no front-end or fetch hazards). o All
instructions have one-cycle latency except for cache misses. L1 cache misses / L2 cache hits are 10
cycles. L2 cache misses / L3 cache hits are 30 cycles, and L3 cache misses / memory
requests have a 200-cycle latency. Hence, the total round-trip time from
processor to memory is 240 cycles. o The memory model will consist of a 3-level cache
hierarchy, consisting of an L1 split instruction and data caches, an L2, and
an L3 (Last Level) cache. All caches support 64-byte lines. The L1
instruction cache is 32KB 4-way set associative cache with LRU replacement.
The L1 data cache is 32KB 8-way set-associative with true LRU replacement.
The L2 data cache is 256KB, 8-way set-associative with true LRU replacement.
The Last Level Cache (LLC) is a 16-way cache with true LRU replacement
implemented as the default configuration. This algorithm is implemented using
an additional four bits for every cache line that reflect reference LRU order
(with 0 being the most recently used line and 15 being the least recently
used line). The LLC is a non-inclusive (also non-exclusive) cache. The LLC
size depends on the configurations specified later in the document. We
will distribute the framework as a library that includes a header file to
which the contestant will add his/her replacement algorithm code. The
replacement API will have access to a structure which reflects the L2 misses
that cause LLC references . The following is a list
of data fields that will be updated every cycle, and
can be polled by the replacement algorithm API. The LLC state structure is
updated with the following information about LLC events: o Event type (Instruction Fetch, Load, or Store) for
the instruction that caused the LLC reference. o Virtual address (program counter) of the instruction
that caused the LLC reference. o Virtual address for the data being requested by the
LLC reference. o Thread number for the instruction that caused the
LLC reference. For single core simulation, thread number is always zero. For
multi-core simulation, thread number ranges from 0 to Ncores-1. In addition to this information, contestants are
provided with the following storage budget: (1) An additional four bits per
cache line on top of the bits used by LRU (for a total of eight bits per
cache line); (2) An additional 1K bits for the whole cache to be used for
common data structures that are not associated with cache lines. Note that contestants can choose to use some of the
bits associated with cache lines as part of the common data structures. For
example, a contestant may choose to use only 5 bits with each cache line, and
use the remaining storage (3 bits/line x number of lines + 1k bits) to build
common structures. This storage limit includes all the state required for the
LLC replacement algorithm. This limit only applies to the storage available
to implement the algorithm. The contestant will not be
penalized for the complexity of the algorithm logic. Configurations Used for Evaluation Submissions
will be evaluated based on the measured performance of their replacement
algorithms over different configurations: o Configuration 1: A
single core, single thread configuration with a 1 MB LLC. Total storage that
can be used by the replacement algorithm is is 129
K Bits (16 K lines x 8 bits per line + 1K bits for the whole cache),
including storage currently used for LRU.
o Configuration 2: A four-core
configuration with a 4 MB shared LLC. Total storage that can be used by the
replacement algorithm is 513 K bits (64K lines x 8 bits per line + 1K bits
for the whole cache), including storage currently used for LRU. |
|