Locality of reference also termed as principle of locality is the common idea in area of compiler optimization. In a nutshell, It means that program execution are faster when they are written in such a way that memory access is made within the small range of previous memory access. This process is done because accessing the content from cpu cache itself is far more performant than accessing from main memory
# Introduction
In context of modern CPU, when a processor access a memory location then to gain better performance It caches handful of memory location just next to currently accessed memory. Suppose, at some instant of time cpu accessed the memory location 0x55ac388b49d0
. Then depending upon the size of L-cache memory, processor will cache the content of memory address 0x55ac388b49d1
0x55ac388b49d2
0x55ac388b49d3
and so on.
This mean if now processor access the memory 0x55ac388b49d1
It do not have to do anything to fetch the data as the content is already in the cache so simple that content will be used.
But in the mean time after accessing 0x55ac388b49d1
program tends to read the memory at location 0xffe34445ffff
then it probably won’t be in the cache memory (given the fact that in average laptop cahce memory are around 6-12 mb). So to do any operation on that given location it will have to step away from the cache and access main memory. Which is worse than accessing from the cache or register itself.
For quick refreshment this is the order of component in increasing latency cost of accessing them
accessing cost of: cpu register < L1 cache < L2 cache < Main memory < secondary storage like SSD/HDD
But the money cost of them are in opposite order. i.e cost of having more equal amount of L1 cache in your PC is far far expensive than having same amount of Main memory. Thats why our PC have only few Mb of cpu cache, tens of Gb of Main memory and hundreds of Gb of secondary storage. Thay are in order of increasing capacity, decreasing cost and decreasing read/write speed.
Having this in mind, Let’s have a quick example
Understand with example
Have a quick glance at below example
fn main() {
let array_one = [true; 10_00_000]; // allocate 1000 bytes of memory 1 for each bool value
let array_two = [false; 10_00_000]; // allocate another 1000 bytes of memory
array_one[0] = false; // location: 0x7ffc93f560a8
array_one[1] = false; // location: 0x7ffc93f560a9
array_one[2] = false; // location: 0x7ffc93f560aa
// ............
// .............
// location: 0x7ffc93f5648f for last element
array_two[0] = true; // location: 0x7ffc9404a2e8
array_two[1] = true; // location: 0x7ffc9404a2e9
array_two[2] = true; // location: 0x7ffc9404a6cf
// .............
// .............
// location: 0x7ffe041caf97 for last element
}
Now lets look at this program line by line
Decleration of main function
Allocate 1 million true boolean value in array names array_one. memory required to represent single boolean value in rust is 1 byte so this statement allocates 1 million bytes of memory. Lets say from
0x7ffc93f560a8
to0x7ffc93f5648f
Allocate another 1 million bytes of memory for 1 million false boolean value in variable array_two from
0x7ffc9404a2e8
to0x7ffe041caf97
Access first element of array_one which is in the memory location
0x7ffc93f560a8
Access secons element of array_one which is in memory location
0x7ffc93f560a9
. Note that by this time, cpu probably already have cached content ofarray_one[2]
,array_one[3]
,array_one[4]
and so onAccess third element of array_one which is in memory location
0x7ffc93f560aa
. As this value is already in cache it will be super fast to execute next statement. If the program had also accessedarray_one[3]
it would also be super instant as that value is also in cache already.Access the first element of array_two which is in memory location
0x7ffc9404a2e8
. Before this point, CPU cache was holding the value of address0x7ffc93f560aa
and following memory location. But here the program have to access the content of memory0x7ffc9404a2e8
which is about 1 million bytes far. As cpu caches are not much big to hold million bytes of cache from single program, now processor have to access the content from main memory which would be slow compared to accessing from cacheAccess the second element of
array_two
. At this point, depending upn the cahcing alogrithm of your cpu, it may need to clear the previous caches fromarray_one
and write new caches fromarray_two
Access third element of
array_three
which is hopefully in cache so may be instant.What will Compiler do
Before talking about how and what compiler would do to optimize the program so that it better align with principle of localoty, If the logic in your code is somewhat complex for compiler then it is free to do absolutely nothing. Also optimization differs across compilers for eg: Compiler for costum embeded processor may not even have anything that deals with optimizing things.
In addition to that, compiler should preserve the behaviour of the program i.e in greed of optimizing the code it is never intended to get behaviour of program different from what programmer have written. As someone quoted correclty,
Corectness of program should be prioritized over performance
Ok, Now let’s look at another program which most of compiler will hopefully optimizing by taking principle of localoty in mind.
fn main() {
let a = vec![2; 1000];
let b = vec![1; 1000];
let c = vec![3; 1000];
// Set all elements to 0
for i in 0..1000 {
a[i] = 0;
b[i] = 0;
c[i] = 0;
}
}
If elements of vector A
is stored in continous memory starting from address Q
. Thay are than arranges in memory location as Q
-> Q + 1
-> Q + 2
-> ........
-> Q+998
-> Q+999
Similarly, Elements of B
starts from R
-> R + 1
-> ....
-> R+999
,
Also, elements of C
from S
-> S + 1
-> ...
-> `S + 999
Then according to the program defined, memory access will be in order of
Q
-> R
-> S
-> Q+1
-> R+1
-> S+1
-> ...
-> Q+999
-> R+999
-> S+999
Now if CPU caches Q+1
Q+2
ans so on while reading Q
then it would complety be useless as next access is made to location R
Again, if R
is accessed then cpu may cache the memory address R+1
R+2
ans so on. This will still be useless as next access of memory is to the location S
This will repeat again and again. Now we can realize how an innocent looking program be so bad in terms of caching. If we were given to write the program with exact same behaviour but in such a way that utilization of cache is maximum then it may look something like
fn main(){
let a = vec![2; 1000];
let b = vec![1; 1000];
let c = vec![3; 1000];
for i in 0..1000 {
a[i] = 0
}
for i in 0..1000 {
b[i] = 0;
}
for i in 0..1000 {
c[i] = 0;
}
}
Now the memory access is as:
P
–> P+1
–> P+2
…..P+1000
-> Q
-> Q+1
–> … Q+1000
–> R
–> R+1
–> R+2
…. R+1000
Again if we are to break down the access, then
When accessing the location P
, P+2
, P+3
cpu would have cache the content of location P+4
, P+5
, up to more element depending upon the size of cache memory. And when cache finishes let say around P+31
then it need to access P+32
, P+33
from main memory bu up to that point other location would have been cached. Hence this is way more better utilization of cpu than previous version
Compilier optimizer is supposed to do same thing. So as long as your compiler have an optimizer, from previous version of this program, it may generates the assembly that would be equivalent to assembly of second program as long as behaviour of program is preserved.
So if you are writing code in modern and matured compiler like backed by llvm or clangd and you compiled previous version of progarm with optimizer enabled (in release mode in context of rust) then you can leave those tasks upto compiler for optimizier.
But if your program logic is somewhat complex and you don’t think compiler may do that then it is of course better idea to reconsider as long as doing so will not make your codebase a mess and harder to reason about.
# Further Readings
Temporal locality
This is the supposion that same memory address will be accessed next time
Saptial locality
This is the supposition that if a memory is accessed then next time some nearby memory will be accessed. Abive example is the examle of this type of locality as processor will suppose that is P memory is address then something near as P+1 will be accessed next time
Branch locality
*Equidistant locality