Memory Hierarchy and cache

14 minute read

Computer RAM Hierarchy and Cache

This is a summary created from course notes and other information online.

Access methods

Memory in a computer can be organised into four main types of access: sequential, direct, random and associated.

Sequential access dictates that the memory can only be accessed in a sequential order normally, such as Tape drives, where going to a random location can have a large cost (access time).

Random access means that each addressable location in the memory device has a unique address and thus all accesses have fairly equal access time (Such as RAM).

Direct memory is accessed using a direct reading with very low latency such as cache memory.

Associated memory uses a random access mechanism so it can be retried rapidly but it only partially matches the address. This means that the same address could actually hold different addresses worth of data. This is the cache memory.

Performance

From the outside, each of these types of access has different performance, and this is characterised by access time, memory cycle time, and transfer rate.

Access time is the time that it takes to perform the read or write cycle to that memory device.

Memory cycle time is mainly applicable to the system RAM and is the access time plus any extra time before the next operations cycle starts. This can include time from refresh operations or other housekeeping.

Transfer rate is the overall rate of data transfer into or out of the memory device.

Memory Design

Memory is designed with a trade-off between the cost per byte, total device capacity and the performance. Because of this, we often have a minimum of 3 levels of memory design in our system, each working to hide the latency of the lower. This three-way relationship is observed in contrast between main program storage (SSD or HDD), RAM and the CPU cache. The Cache hides the RAM latency, and RAM is used to hide the main storage latency.

On-Chip Memory

On the main CPU in the system, there are the fastest memories in the entire system. These are often built out of SRAM as it can have no latency.

Each CPU core has its own Register file, where all of its results are worked on, this runs at the same speed as the core itself (Often in GHz). These are very small in size because of the high speed they run at, as they take up a large amount of silicon area.

The next levels are the TLB and Cache for the core, which is usually slightly slower (L1 Cache), but much larger. This uses the Associated access pattern.

There are often other levels (L2, L3) of cache that is increasingly larger and slower (But still faster than RAM).

System RAM

The main computer RAM then stores all of the temporary storage of the executing programs along with caching the hard main storage of the system.

Main storage

This is your Hard drive or Solid State Disk. Slow in the realm of computers, but holds a huge amount of data.

Locality

Most programs are designed around a code base that is moderately linear in structure, i.e. If the program accesses address A, it’s more likely to next go to A+1 rather than A+20000. This applies to both the program code as well as the data accessed by the program. Because when hardware accelerations first started, most programs were written in a linear fashion, the acceleration methods are designed around this. This then designs in this as being the optimal data access pattern, which has resulted in further optimisation of programs and hardware to suit this access method (linear).

Taking advantage of the locality of software, programs and files nearby can be preemptively cached into the system RAM from the disk. This can hide the latency of fetching files from the hard disk.

This concept is then used a second time to pre-emptively cache data from the system RAM into the caches of the CPU.

Management

Registers <-> Memory

The movement of data into and out of the CPU registers from the system RAM is managed by the compiler (or the user)

Cache <-> Memory

The movement of data from the Caches and into the system memory is managed transparently by the hardware in the processor. This means that programmers can treat the system memory as being as fast as possible and let the hardware deal with managing the movement of this data. This adds a fair amount of complexity to the CPU itself, but its largely a one time cost at the design stage and a small cost for each processor made.

This is a better trade-off, as it means that programs don’t need to be overly optimized for memory access. There are side channels made from this process, however, as has been found in recent history, as this cache needs to be managed across multiple cores of the processor to ensure that dirty data is handled correctly.

Memory <-> Disks

The operating system handles the bulk of the movement of this data using Virtual memory. This is assisted by the TLB that maps the virtual addresses to the physical addresses used in the hardware. Files are managed by the programmer, but these still ultimately use the Operating System to handle the actual implementation details.

Disk access

Hard drives are still prevailing even with the common use of solid-state media. Disk offers the lowest cost for storage that is mostly random (Tape can be cheaper again). However, they suffer from losses from the physical movement aspect of the drives, which can result in rapidly growing queues.

Example look

An example to illustrate the issues the mechanical aspect can raise:

Given a disk with the following specifications:

  • 512 Byte sectors
  • 15000 rpm
  • 4ms average seek time
  • 100MB/sec transfer rate
  • 0.2ms controller overhead
  • The disk starts idle

The average seek time can be calculated as

  • 4ms seek time (time to get to target spot) +
  • (15000 rpm / 60 seconds ) /2 = 2 ms in rotational latency + (1)
  • 512 Bytes / 100MB/Sec = 0.005ms transfer time +
  • 0.2ms controller latency +
  • Equals around 6.2ms average read time for any sector on this disk.

(1) This comes from the rotation speed per second of the disk, and the probability of where on the disk the data is, with the worst case being the other side (thus half a rotation).

This is further complicated as most hard drives will have an onboard cache to hide some of this latency for bulk transfers (reads of sequential data). This works very similar in concept to the caches on a CPU with tweaks to suit the medium.

Error correction

In all mediums it is possible to have errors show up, these can be detected using CRC or Hamming codes. The idea of these is that the data retrieved from the storage medium has extra data stored with it for error correction. For small errors, the data can be transparently repaired and passed onto the device without it even knowing about it.

Looking into Hamming codes here as they are an easier to understand example to solve If we have M bits of data to protect, and we use K bits of error correcting code space, then so long as (2^k-1)>=M+K all errors can be detected and repaired transparently.

In modern CRC codes, often more bits can be detected than can be repaired, which allows the host to at least know the data is invalid.

Hamming Distance

The Hamming distance of two numbers is the number of bits that are different between the two sets of bit data.

Hamming Codes

Looking at the common 8,4 (12 bit) hamming code, as it is versatile as it allows for error correction of 8-bit data.

To calculate the code by hand (In software these can all be XOR streaming operations), the data to encoded can be laid out as :

  • [P1][P2][D0][P4][D1][D2][D3][P8][D4][D5][D6][D7]
  • P1 = P1 ^ D0 ^ D1 ^ D3 ^ D4 ^ D6
  • P2 = P2 ^ D0 ^ D2 ^ D3 ^ D5 ^ D6
  • P4 = P4 ^ D1 ^ D2 ^ D3 ^ D7
  • P8 = P8 ^ D4 ^ D5 ^ D6 ^ D7

Cache Memory

Cache memory is the memory closest to the CPU in the system. This memory works to try and have the data the CPU is most likely need already available, which helps to avoid the long wait for data to arrive from the RAM.

Direct Mapped Cache

Direct mapped cache is the simplest implementation, the cache is set up such that it could hold data from memory that’s address’s least significant few bits matches that cache slots address. For example: If the cache consisted of 8 slots of memory, these would have addresses in the range 0b000 - 0b111. The slot with address 0b101 could have data from main memory with an address of {0b00101, 0b01101, 0b10101, 0b11101} This memory will often have extra “tag” bits associated with it that can store the remaining bits of the address so that it can be checked if the wanted data is there, along with this there are often also a “Data valid” and “Dirty” bits, representing if the data is valid and if the data there needs to be written back to the main ram before it can be evicted, respectively.

Direct Cache

Block Size

For each address in the cache, there is the actual data associated with this. The amount of this data stored is called the block size. In cache systems, a Block is the smallest chunk of data stored. Having a larger block size means that a cache hit contains more data and therefore the miss rate should be lower.

However with the larger block sizes, the cost for a miss is significantly higher, also the cache is often kept at a fixed size, so this can dramatically reduce the number of entries available to use in the system, further exasperating this issue.

Associativity

Associativity is the measure of how the cache is designed, a fully associative cache means that a block could be stored anywhere and a single associative cache has only one possible slot for a memory block.

Fully Associative

A fully associative cache can store a block in any location, and so when storing a block the hardware must search the tag for each and every block in storage. This dramatically increases the hardware requirement as a hardware comparator is required on every single tag for testing the full memory address space. This is usually only performed for small memories, but it does allow for keeping a larger percentage of relevant information in the cache. (i.e. Only the oldest data is kicked out, rather than based on a slot).

n-way set associative

Means that there are n entries for each possible slot. A 1-way set associative memory is a direct mapped cache, whereas a 2-way set associative memory has half the number of positions but can store 2 different blocks under each (assuming same total size).

This allows for an adjustability between hardware complexity and cache utilization.

Direct Cache

Selecting the associativity

The choice of the associativity depends on the cost of a miss vs the cost of the implementation. The largest gains are in going from 1-way to 2-way (20%+ reduction in miss rate).

For further reduction, using multiple levels of caching.

Cache replacement policy

When the data in the cache needs to be evicted to make room for the new data coming in, there can be different policies used to select the slot.

In a direct mapped (1-way associative) there is no choice as each memory location can only map to one slot. In higher values of n there are options to select how to evict the memory.

Preference is given to a non-valid entry if there is one, as its better to not have to evict at all.

Otherwise, choose among the entries in the set using one of two options; Least recently used or random.

Least-recently-used evicts the entry that has been unused for the longest, this is simple for 2 way and fairly manageable for 4 way but above that it becomes hard to layout.

Random, is the obvious option where an entry is simply picked at random, and at higher orders of n this gives similar performance to the LRU method. This is a much simpler task to implement in hardware than LRU for the higher values of n.

Multi-level caches

In modern CPU design, often there are L1, L2 and L3 caches. These decreasing levels of cache work to lower the miss rate of the cache and also provide higher performance for a lower cost.

  • L1 cache is often built as 1-way associative, small size but extremely fast cache.
  • L2 cache is larger and slower but still faster than L3
  • L3 is the largest cache and is often shared across the entire die, but still dramatically faster than the main ram.

The primary cache (L1) is focused on the minimal hit time (as fast response as possible). Whereas the L2 cache is designed to focus on a low miss rate to avoid main memory access as much as possible. So hit time is of less importance in the design.

Therefore, L1 is usually much smaller to keep the speed high, and L1 has a smaller block size than L2.

Example

Given a CPU with:

  • a base Clocks per instruction of 1
  • the clock speed is 4GHz
  • Miss rate per instruction of 2%
  • Main memory access time of 100ns

For a single level of cache:

  • Miss penalty = 100ns / 0.25ns = 400 cycles of waiting
  • Therefore the effective CPI = 1 (ideal) + (2% * 400) = 9

Adding an L2 cache:

  • Access time = 5ns
  • Global miss rate to the main memory of 0.5%

Primary miss with L2 hit

  • 5ns/0.25ns = 20 cycles wasted

Primary miss and L2 miss

  • A penalty of 400 cycles

CPI = 1 + (2%*20) + (0.5% * 400) = 3.4

Performance increase of 9/3.4 = 2.6 times faster.

Handling reads and writes

Cache Misses

A cache miss is when the processor asks for a memory location and it’s not located in the cache, often these are measured separately for each level of the cache.

The control unit detects these misses and will then fetch the corresponding data from memory.

Case miss handling is often done in collaboration with the processor control unit, and a separate controller that manages memory access and refills the cache.

When the cache misses and it is waiting for memory, the pipeline is stalled (rather than an interrupt).

Cache writes

When writing data back out of the CPU, it can be written to the main memory and cache (write-through) or only to the cache (write-back).

When data is written to both caches (write-through), the CPU is halted until the main system memory has been updated as well, so these are rather slow. It is preferable to use write-back and allow the memory controller to handle the write synchronisation to the main system memory.

With Write-back memory, write buffers are used to buffer the data to try and prevent stalls. The CPU can push the write into this buffer and the memory controller performs the write while the CPU keeps on working. If the CPU attempts to write to the buffer and the buffer is full, the CPU will stall as if there was no buffer. This means that the write-back cache degrades to the write-through performance in worst case scenarios.

Often the memory is not written back until a new memory block needs that location, this is so that adjacent writes coalesce without hardware overheard, and the latency of the wrong block being in that location would be incurred anyway.

Write cache miss

If the CPU attempts to write to a block of memory that is not available in the cache, the memory controller must first fetch the matching piece of system memory, then perform the change and then follow through with the write.

Associative cache bit mapping

When a memory address is passed into the cache controller to be decoded to the cache it is decoded into 4 key sections:

  • [Tag][Index][BlockOffset][Byte Offset]

The Tag is the data stored in the cache and is used for looking up if that cache line stores the data wanted.

The Index is used to select which slots of the memory the data can be stored in.

The Block offset selects the word inside that block of memory.

The Byte offset selects the byte in that word of memory.

Increasing the associativity decreases the Index size and increases the tag size.

Summary

Overall, caching is constructed as one or more layers designed to hide the latency of the further layers outwards. Its created as an n-way associative cache which has design trade-offs between implementation and runtime costs.

Using the multiple levels of caching hides each further layer :

  • L1 hides L2
  • L2 hides L3
  • L3 hides system RAM
  • RAM hides HDD

All of this is build on two concepts of locality, that programs tend to access the same data multiple times (time), and that access tends to be in linear order (space).