As software engineers we want to make code run faster. It’s often only an issue when it’s an issue in production.
When we write software we look for things that will make it run slow like:
- Create objects in loops
- Recomputing values multiple times
- Database cache misses
Something that we don’t think about is hardware implications of code. Look at the following methods, what would you expect the runtime of the two to be (as a constant compared to the other).
void loopOverSecondIndex() {
for(int i = 1; i < SIZE; i++){
for(int j = 1; j < SIZE; j++) {
largeArray[i][j] += largeArray[i][j-1];
}
}
}
void loopOverFirstIndex() {
for(int j = 1; j < SIZE; j++) {
for(int i = 1; i < SIZE; i++){
largeArray[i][j] += largeArray[i][j-1];
}
}
}
I would say the ratio is 1:1 from a quick glance. However running the following program we got some supprising results.
#include <stdio.h>
#include <time.h>
#define SIZE 2000
#define TEST_SIZE 100
int largeArray[SIZE][SIZE];
void setup();
long runTest(void (*funPointer)(void));
void loopOverSecondIndex();
void loopOverFirstIndex();
void loopOverAndModifyFirstIndex();
void printStats(char* testRun, long* results);
int main() {
long firstIndexResults[TEST_SIZE];
long firstIndexAndModifyResults[TEST_SIZE];
long secondIndexResults[TEST_SIZE];
for(int i = 0; i < TEST_SIZE; i++) {
firstIndexResults[i] = runTest(loopOverFirstIndex);
}
printStats("First Index [i+1][j]", firstIndexResults);
for(int i = 0; i < TEST_SIZE; i++) {
firstIndexAndModifyResults[i] = runTest(loopOverAndModifyFirstIndex);
}
printStats("Second Index [i+1][j] += [i][j]", firstIndexAndModifyResults);
for(int i = 0; i < TEST_SIZE; i++) {
secondIndexResults[i] = runTest(loopOverSecondIndex);
}
printStats("Second Index [i][j+1]", secondIndexResults);
}
void printStats(char* testRun, long* results) {
long long averageTime = 0;
long min = results[0];
long max = results[0];
for(int i = 0; i < TEST_SIZE; i++) {
averageTime += results[i];
if( results[i] < min ) {
min = results[i];
}
if( results[i] > max ) {
max = results[i];
}
}
printf("%s\n", testRun);
printf("---- Stats for run ----\n");
printf("Max Runtime: %06ld\n", max);
printf("Average Runtime: %06ld\n", (long)(averageTime / TEST_SIZE));
printf("Min Runtime: %06ld\n\n", min);
}
void loopOverSecondIndex() {
for(int i = 1; i < SIZE; i++){
for(int j = 1; j < SIZE; j++) {
largeArray[i][j] += largeArray[i][j-1];
}
}
}
void loopOverFirstIndex() {
for(int j = 1; j < SIZE; j++) {
for(int i = 1; i < SIZE; i++){
largeArray[i][j] += largeArray[i][j-1];
}
}
}
void loopOverAndModifyFirstIndex() {
for(int j = 1; j < SIZE; j++) {
for(int i = 1; i < SIZE; i++){
largeArray[i][j] += largeArray[i-1][j];
}
}
}
long runTest( void (*funPointer)(void) ) {
setup();
clock_t startTime, endTime;
startTime = clock();
(*funPointer)();
endTime = clock();
//return (endTime - startTime)/(double)CLOCKS_PER_SEC;
return (endTime - startTime);
}
void setup() {
for(int i = 0; i < SIZE; i++) {
for(int j = 0; j < SIZE; j++) {
largeArray[i][j] = 1;
}
}
}
To run it we run (from _include/software/hardware-runtime):
$> gradle mainExecutable && ./build/binaries/mainExecutable/main
:mainCExtractHeaders UP-TO-DATE
:compileMainExecutableMainC
:linkMainExecutable
:mainExecutable
BUILD SUCCESSFUL
Total time: 0.671 secs
First Index [i+1][j]
---- Stats for run ----
Max Runtime: 038741
Average Runtime: 032946
Min Runtime: 031661
Second Index [i][j+1]
---- Stats for run ----
Max Runtime: 013336
Average Runtime: 009877
Min Runtime: 009419
The loopOverSecondIndex method is almost 3.3 times as fast as loopOverFirstIndex. Wow… Thats a supprising result.
Dig into the results
So we need to go over some definitions:
- Cache
- A processor has several of these. Let’s call them L1 and L2. To get data into L1 it must first be moved into L2. To get into L2 it has to be in RAM. The process in which memory moves from RAM into a cache is hardware driven. When there is a cache miss, the processor will move to the next level to pull in values.
- Caches decrease runtime. Some documentation can be found on Wiki.
- Locality
So what does this mean to us as engineers?
The faster example ([i][j+1]
) uses Locality to improve performance. When you use a varable like foo[0][0]
, that isn’t the only data that gets pulled into the cache, the surrounding memory comes with it. For example you will have foo[0][0]
through foo[0][3]
in the cache. In this example ever 4 requests will only have 1 cache miss.
In the slower example ([i+1][j]
) you pull in foo[0][0]
though foo[0][3]
into the cache. Instead of the next request being foo[0][1]
it’s foo[1][0]
. This means that there will be a cache miss EVERY time. Forcing you to go out to L2 or RAM to get the value.
Something that processor also do for you, is try to pre-fetch values from memory in an attempt to speed up the runtime.
Sooo…. What does this mean?
If you care about runtime, you should thing about how things are layed out in memory and HOW you access things. If you have to loop over the column vs a row, you may want to transform the data to be where you can run along the row.