Memory Benchmark

Memory performance comparison on different memory regions

Note: old hugepage comaprison result has been move to here

About memory benchmark

A memory benchmark program called memlatency measures memory latency using a simple benchmark code like a meta-code shown in the following figure, and invokes this routine a couple of times for calculating average latency and measurement error(standard deviation). Basic parameters on this benchmark code would be the memory size and the number of iterations. By default, the benchmark program allocates a fairly big amount of memory something like 64MB, 128MB. On 4KB paging on x86 arch, 128MB of memory region, for instance, is covered by 32,768 PTE entries that will consume 128KB if Physical Address Extension(PAE) is disabled. It will consume 256KB on PAE mode. As shown in, since memory region is allocated every time, latency on small iteration is bigger on random access patter because of pagefault which installs PTE entries to main memory. Once all PTEs are installed, memory latency is gradually dropping. In this page, results are represented as 2D graph, where X-axis is "n_iterations", Y-Asix is "average latency" and each measured point has a vertical measurement error bar.

allocate_memory(size); start_time = get_time(); for(i=0; i<n_iterations; i++) { [ memory operation ] } end_time = get_time(); free_memory(); latency = (end_time-start_time)/n_iterations;

Memory access pattern

Hardware(or system software) might behave differently on memory access pattern, so we pick three distinctive memory access patterns which are shown below. Stream copy or random copy access pattern does one load and one store operation per iteration. Random read access pattern does two load operations because we wanted each access pattern to perform the same amount of memory operation. By default, variable type of each element is double, which is 8 bytes. So each iteration performs 16 bytes of memory operation.

The first pattern is a simple memory copy like pattern, which simply divides an allocated memory region half and copy from a half to another half element by element. This type of memory access pattern obvious shows characteristic of hardware prefetch.

The second pattern picks randomly two positions, loads data from the first position and saves it to the second position. This access pattern may show TLB miss penalty.

The last one does only load operation while previous two pattens are load, store combination. Each element in the memory array contains a random number which is less than the size of array. The content of element is going to be an index for the next load operation.

Kazutomo Yoshii <[email protected]>

Last modified: Fri Aug 25 15:06:19 CDT 2006