

Most of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook "Computer Systems: A Programmer's Perspective," 2nd Edition and are provided from the website of Carnegie-Mellon University, course 15-213, taught by Randy Bryant and David O'Hallaron in Fall 2010. These slides are indicated "Supplied by CMU" in the notes section of the slides.

This is the first of two lectures on memory hierarchy. The second, covering secondary storage (disk, etc.) will be given in a few weeks.



## **SRAM vs DRAM Summary**





Note that the chip in the slide contains 16 supercells of 8 bits each. The supercells are organized as a 4x4 array.







The memory controller pulls in eight supercells from eight DRAM modules and transfers them to the processor over the memory bus.



Adapted from a slide supplied by CMU.



This slide is based on figures from **What Every Programmer Should Know About Memory** (http://www.akkadia.org/drepper/cpumemory.pdf), by Ulrich Drepper. It's an excellent article on memory and caching.

It is costly to make DRAM cell arrays run at a faster rate. Thus, rather than speed up the operation of the individual modules, they are organized to transfer in parallel. Thus, all that needs to be sped up is the bus that carries the data (something that is relatively inexpensive to do).

With SDR (Single Data-Rate DRAM), the DRAM cell array produces data at the same frequency as the memory bus, sending data on the rising edge of the signal.

With DDR1 (double data-rate), data is sent twice as fast by "double-pumping" the bus: sending data on both the rising and falling edges of the signal. To get data out of the cell array at this speed, data from two adjacent supercells are produced at once. These are buffered so that one doubleword at a time can be transmitted over the bus.

With DDR2, the frequency of the memory bus is doubled, and four supercells are produced at once. DDR3 takes this one step further, with eight supercells being produced at once. DDR4 takes this a step further and delivers 16 supercells at once.

Note that the processor fetches and stores 64 bytes of data at a time (for reasons having to do with caching, which we cover later in this lecture).



DDR4 memory became available in 2015. It's 16 times as fast as SDRAM, but transfers 64 consecutive bytes at a time, the same as DDR3. DDR5 is currently being discussed.

## **Quiz 2**

**A program is loading randomly selected bytes from memory. These bytes will be delivered to the processor on a DDR4 system at a speed that's n times that of an SDR system, where n is:**

**a) 8**

**b) 4 c) 2**

**d) 1**

**CS33 Intro to Computer Systems XVII–11** Copyright © 2024 Thomas W. Doeppner. All rights reserved.

# **CS33 Intro to Computer Systems XVII–12** Copyright © 2024 Thomas W. Doeppner. All rights reserved. **A Mismatch** • **A processor clock cycle is ~0.3 nsecs** – **Older SunLab machines (Intel Core i5-4690) run at 3.5 GHz** • **Basic operations take 1 – 10 clock cycles** – **.3 – 3 nsecs** • **Accessing memory takes 70-100 nsecs** • **How is this made to work?**



Sitting between the processor and RAM are one or more caches. (They actually are on the chip along with the processor.) Recently accessed items by the processor reside in the cache, where they are much more quickly accessed than directly from memory. The processor does a certain amount of pre-fetching to get things from RAM before they are needed. This involves a certain amount of guesswork, but works reasonably well, given well behaved programs.



"ALU" (arithmetic and logic unit) is a traditional term for the instruction and execution units of a processor.



















Note that the cache holds two rows of the matrix; each cache block holds four doubles. When a[0][0] is read, so are a[0][1] through a[0][3]. Thus, after one cache miss, we get three hits.



For each reference to an element of the matrix, its entire row is brought into the cache, even though the rest of the row is not immediately used.



If arrays x and y have the same alignment, i.e., both start in the same cache set, then each access to an element of y replaces the cache line containing the corresponding element of x, and vice versa. The result is that the loop is executed very slowly  $-$  each access to either array results in a conflict miss.



However, if the two arrays start in different cache sets, then the loop executes quickly there is a cache miss on just every fourth access to each array.



Many of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook "Computer Systems: A Programmer's Perspective," 2nd Edition and are provided from the website of Carnegie-Mellon University, course 15-213, taught by Randy Bryant and David O'Hallaron in Fall 2010. These slides are indicated "Supplied by CMU" in the notes section of the slides.



The L3 cache is known as the *last-level cache* (LLC) in the Intel documentation.

One concern is whether what's contained in, say, the L1 cache is also contained in the L2 cache. if so, caching is said to be **inclusive**. If what's contained in the L1 cache is definitely not contained in the L2 cache, caching is said to be **exclusive**. An advantage of exclusive caches is that the total cache capacity is the sum of the sizes of each of the levels, whereas for inclusive caches, the total capacity is just that of the largest. An advantage of inclusive caches is that what's been brought into the cache hierarchy by one core is available to the other cores.

AMD processors tend to have exclusive caches; Intel processors tend to have inclusive caches.



Most current processors use the write-back/write-allocate approach. This causes some (surmountable) difficulties for multi-core processors that have a separate cache for each core.



This slide describes accessing memory on Intel Core I5 and I7 processors.

If the processor determines that a program is accessing memory sequentially (because the past few accesses have been sequential), then it begins the load of the next block from memory before it is requested. If this determination was correct, then the memory will be in the cache (or well on its way) before it's needed.













"Stride n" reference patterns are sequences of memory accesses in which every nth element is accessed in memory order. Thus stride 1 means that every element is accessed, starting at the beginning of a memory area, continuing to its end.



Based on slides supplied by CMU.



Adapted form a slide by CMU.





Assume we are multiplying arrays of doubles, thus each element is eight bytes long, and thus a cache line holds eight matrix elements. The slide shows a straightforward implementation of multiplying A and B to produce C.



If we reverse the order of the two outer loops, there's no change in results or performance.



Moving the loop on k to be the outer loop does not affect the result, but it improves performance.



Switching the two outer loops affects neither results nor performance.



Moving the loop on i to be the inner loop makes performance considerably worse.



The poor performance is not improved by reversing the outer loops.





















## **Which Instructions Should Be Privileged?**

- **I/O instructions**
- **Those that affect how memory is mapped**
- **Halt instruction**
- **Some others ...**

**CS33 Intro to Computer Systems XVII–64** Copyright © 2024 Thomas W. Doeppner. All rights reserved.

## **Who Is Privileged?**

- **No one**
	- **user code always runs in user mode**
- **The operating-system kernel runs in privileged mode**
	- **nothing else does**
	- **not even super user on Unix or administrator on Windows**

**CS33 Intro to Computer Systems XVII–65** Copyright © 2024 Thomas W. Doeppner. All rights reserved.

# CS33 Intro to Computer Systems<br> **XVII–66** Copyright © 2024 Thomas W. Doeppner. All rights reserved. **Entering Privileged Mode** • **How is OS invoked?** – **very carefully ...** – **strictly in response to interrupts and exceptions** – **(booting is a special case)**

#### **Interrupts and Exceptions**

- **Things don't always go smoothly ...**
	- **I/O devices demand attention**
	- **timers expire**
	- **programs demand OS services**
	- **programs demand storage be made accessible**
	- **programs have problems**
- **Interrupts**
	- **demand for attention from external sources**
- **Exceptions**
	- **executing program requires attention**

**CS33 Intro to Computer Systems XVII–67** Copyright © 2024 Thomas W. Doeppner. All rights reserved.

#### **CS33 Intro to Computer Systems** *XVII–68* Copyright © 2024 Thomas W. Doeppner. All rights reserved. **Exceptions** • **Traps** – **"intentional" exceptions** » **execution of special instruction to invoke OS** – **after servicing, execution resumes with next instruction** • **Faults** – **a problem condition that is normally corrected** – **after servicing, instruction is re-tried** • **Aborts** – **something went dreadfully wrong ...** – **not possible to re-try instruction, nor to go on to next instruction**

These definitions follow those given in "Intel® 64 and IA-32 Architectures Software Developer's Manual" and are generally accepted even outside of Intel.







The reason why there must be a separate stack in privileged mode is that the OS must be guaranteed that when it is executing, it has a valid stack, and that the stack pointer must be pointing to a region of memory that can be used as a stack by the OS. Since while the program was running in user mode any value could have been put into the stack-pointer register, when the OS is invoked, it switches to a pre-allocated stack set up just for it.



When a trap or interrupt occurs, the current processor state (registers, including RIP, condition codes, etc.) are saved on the kernel stack. When the system returns back to the interrupted program, this state is restored.

# **CS33 Intro to Computer Systems XVII–73** Copyright © 2024 Thomas W. Doeppner. All rights reserved. **Quiz 3 If an interrupt occurs, which general-purpose registers must be pushed onto the kernel stack? a) all b) none c) callee-save registers d) caller-save registers**