The memory of your computer lag every 7.8 ms

The memory of your computer lag every 7.8 ms 3r33450. 3r33333.  

Modern DDR3 SDRAM. Source: 3r311. BY-SA /4.0 by Kjerish
3r33333.  
3r33333.  
During a recent visit to r3r319. Computer History Museum
at Mountain View, my attention was drawn to an ancient pattern of 3r-321. ferrite memory
. 3r33333.  
3r33333.  
3r33450. 3r33333.  

Source: 3r3355. BY-SA /3.0 by Konstantin Lanzet
3r33333.  
3r340.
3r33333.  
I quickly came to the conclusion that I have no idea how such things work. Do the rings rotate (no), and why three wires pass through each ring (I still don’t understand how they work). More importantly, I realized that I very poorly understand the principle of operation of modern dynamic RAM! 3r33333.  
3r33333.  
3r33450. 3r33333.  

Source: 3r3355. The Ulrich Drepper cycle of memory
3r33333.  
3r33333.  
I was particularly interested in one of the consequences of how dynamic RAM works. It turns out that every bit of data is stored in charge (or its absence) on a tiny capacitor in a RAM chip. But these capacitors gradually lose charge over time. To avoid the loss of stored data, they should be regularly updated to restore the charge (if any) to the original level. This update process 3r3439. includes reading each bit, and then writing it back. In the process of such an “update”, the memory is occupied and cannot perform normal operations, such as writing or storing bits. 3r33333.  
3r33333.  
It worried me for a long time, and I wondered can I notice a delay in updating at the program level? 3r33333.  
3r33333.  

Training base for updating dynamic RAM

3r33333.  
Each DIMM consists of “cells” and “rows”, “columns”, “sides”, and /or “ranks”.
This presentation is
from the University of Utah explains the nomenclature. The configuration of the computer's memory can be checked with the command decode-dimms . Here is an example: 3r33333.  
3r33333.  
3r33333. $ decode-dimms
Size 4096 MB
Banks x Rows x Columns x Bits 8 x 15 x 10 x 64
Ranks 2
3r33333.  
We do not need to understand the whole DDR DIMM scheme, we want to understand the operation of only one cell storing one bit of information. Or rather, we are only interested in the update process. 3r33333.  
3r33333.  
Consider two sources:
 
3r33333.  
3r33434.  
DRAM update tutorial from University of Utah
 
And excellent documentation gigabit chip from Micron: “Designing TN-46-09 for 1Gb DDR SDRAM” 3r3439. 3r33440.  
3r3442. 3r33333.  
Every bit in the dynamic memory should be updated: it usually happens every 64 ms (the so-called static update). This is quite an expensive operation. To avoid one major stop every 64 ms, the process is divided into 8192 smaller update operations. In each of them, the computer's memory controller sends update commands to the DRAM chips. After receiving the instructions, the chip will update 1/8192 cells. If you count, then 64 ms /8192 = 7812.5 ns or ??? μs. This means the following:
 
3r33333.  
3r33434.  
The refresh command is executed every 7812.5 ns. It is called tREFI. 3r33440.  
The update and restore process takes some time, so the chip can perform normal read and write operations again. The so-called tRFC is either 75 ns or 120 ns (as in the mentioned Micron documentation). 3r33440.  
3r3442. 3r33333.  
If the memory is hot (more than 85 ° C), then the storage time of the data in the memory drops, and the static update time is halved to 32 ms. Accordingly, the tREFI falls to ??? ns. 3r33333.  
3r33333.  
A typical memory chip is busy updating during a significant part of its operation time: from 0.4% to 5%. In addition, the memory chips are responsible for the non-trivial share of the power consumption of a typical computer, and most of this power is spent on upgrades. 3r33333.  
3r33333.  
At the time of the update, the entire memory chip is blocked. That is, every bit in memory is locked for more than 75 ns every 7812 ns. Let's measure. 3r33333.  
3r33333.  

Preparing the experiment 3r33388. 3r33333.  
To measure nanosecond precision operations, we need a very dense cycle, perhaps at C. It looks like this:
 
3r33333.  
3r33333. for (i = 0; i < ; i++) {
//Load into memory.
* (volatile int *) 3) one_global_var;
.3 ////Clear the cache of the CPU. It is relatively slow
/we need a memory barrier, otherwise sometimes the cycle can run
//very quickly (25 ns instead of about 160). 3r3449. //I think because of the reordering. 3r3449. asm volatile ("mfence");
3r3r944. //Measuring and recording time
Clock_gettime (CLOCK_MONOTONIC, & ts);
} 3r3-33298. 3r3380.
 
3r33170. The full code is available on github.
3r33333.  
3r33333.  
The code is very simple. We execute reading of memory. We reset the data from the CPU cache. We measure time. 3r33333.  
3r33333.  
(Note: in r3-33180. The second experiment, r3r3439. I tried to use MOVNTDQA to load data, but this requires a special non-cached memory page and root rights). 3r33333.  
3r33333.  
On my computer, the program gives the following dаta: 3r33333.  
3r33333.  
3r33333. # timestamp, cycle time
310189573? 134
310189586? 132
310189600? 137
310189613? 132
310189626? 134
310189640? 135
310189676? 359
310189690? 139
310189703? 137
3r33333.  
Usually it turns out a cycle with a duration of about 140 ns, periodically time jumps to about 360 ns. Sometimes strange results are popping up more than 3200 ns. 3r33333.  
3r33333.  
Unfortunately, the data is too noisy. It is very difficult to see if there is a noticeable delay associated with update cycles. 3r33333.  
3r33333.  
Fast Fourier transform

3r33333.  
At some point it dawned on me. Since we want to find an event with a fixed interval, we can submit the data to the FFT algorithm (fast Fourier transform), which will decode the basic frequencies. 3r33333.  
3r33333.  
I'm not the first to think about it: Mark Seborn with the famous vulnerability 3r-3219. Rowhammer
implemented this very technique back in 2015. Even looking at the Mark code, making FFT work was harder than I expected. But in the end I put all the pieces together. 3r33333.  
3r33333.  
First you need to prepare the data. FFT requires input data with a constant sampling interval. We also want to trim the data to reduce noise. Through trial and error, I found that the best result is achieved after preprocessing the dаta: 3r33333.  
3r33333.  
3r33434.  
Small values ​​(less than 1.8 average) loop iterations are cut off, ignored and replaced with zeros. We really do not want to introduce noise. 3r33440.  
All other readings are replaced by units, since we really do not care about the amplitude of the delay caused by some noise. 3r33440.  
I stopped at a sampling interval of 100 ns, but any number up to 3r32338 will do. Nyquist frequencies (double expected frequency) 3r3439. . 3r33440.  
Data must be selected with a fixed time before submission to the FFT. All reasonable sampling methods work fine, I stopped at the basic linear interpolation. 3r33440.  
3r3442. 3r33333.  
An algorithm like this:
 
3r33333.  
3r33333. UNIT = 100ns
A =[(timestamp, loop_duration),]
p = 1
for curr_ts in frange (fist_ts, last_ts, UNIT):
while not (A[p-1].timestamp <= curr_ts applying FFT on something like a square wave . 3r33333.  
3r33333.  
To facilitate the experiments, we did command line version . You can run the code yourself. Here is an example of running on my server:
 
3r33333.  
3r33333. ~ /2018-11-memory-refresh $ make
gcc -msse4.1 -ggdb -O3 -Wall -Wextra measure-dram.c -o measure-dram
./measure-dram | python3 ./analyze-dram.py
  • Verifying ASLR: main = 0x555555554890 stack = 0x7fffffefe2ec
    []Fun fact. I did 40663553 clock_gettime () 's per second
  • Measuring MOVQ + CLFLUSH time. Running 131072 iterations.
  • Writing out data
  • Input dаta: min = 117 avg = 176 med = 167 max = 8172 items = 131072
  • Cutoff range 212-inf
    []127849 items below cutoff, 0 items above cutoff, 3223 items non-zero
  • Running FFT
  • Top frequency above 2kHz below 250kHz has 3133 r3449 has a magnitude of 7716.[+]Top frequency spikes above 2kHZ are at:
    127906Hz 7716
    255813Hz 7947
    383720Hz 7460
    511626Hz 7141
    3r33333.  
    I have to admit, the code is not completely stable. In case of problems, it is recommended to disable Turbo Boost, CPU frequency scaling and performance optimization. 3r33333.  
    3r33333.  

    Conclusion

    3r33333.  
    From this work there are two main conclusions. 3r33333.  
    3r33333.  
    We have seen that low-level data is rather difficult to analyze and seems rather noisy. Instead of evaluating with the naked eye, you can always use the good old FFT. When preparing the data, it is necessary in some sense to take what was desired. 3r33333.  
    3r33333.  
    Most importantly, we have shown that it is often possible to measure subtle hardware behavior from a simple process in user space. This kind of thinking led to the discovery of 3r-3399. Rowhammer
    original vulnerability. , it is implemented in the Meltdown /Specter attacks and is again shown in the recent Rowhammer's reincarnation for ECC memory . 3r33333.  
    3r33333.  
    Much remains beyond the scope of this article. We barely touched the internal workings of the memory subsystem. For further reading I recommend:
     
    3r33333.  
    3r33434.  
    3r33414. L3 cache mapping on Sandy Bridge processors
    3r33440.  
    How the physical address is mapped to strings and banks in DRAM 3r3439. 3r33440.  
    3r33424. Hannu Hartikainen cracked the DDR3 SO-DIMM and made it work slower 3r3439. 3r33440.  
    3r3442. 3r33333.  
    Finally, here is a good description of old ferrite memory:
     
    3r33333.  
    3r33434.  
    PDP-11 ferrite memory explanation from the University of Sydney
    3r33440.  
    3r3442. 3r33450.
    ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r33448.
    3r33450.
  • + +1 -

    Add comment