Discussion:
AMD CPU funny
(too old to reply)
Vir Campestris
2023-12-20 17:17:23 UTC
Permalink
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...

For reasons I won't go into I've been writing some code to evaluate
memory performance on my AMD Ryzen 5 3400G.

It says in the stuff I've found that each core has an 8-way set
associative L1 data cache of 128k (and an L1 instruction cache); an L2
cache of 512k, also set associative; and there's an L3 cache of 4MB.

To measure the performance I have three nested loops.

The innermost one goes around a loop incrementing a series of 64 bit
memory locations. The length of the series is set by the outermost loop.

The middle one repeats the innermost loop so that the number of memory
accesses is constant regardless of the series length.

The outermost one sets the series length. It starts at 1, and doubles it
each time.

I _thought_ what would happen is that as I increase the length of the
series after a while the data won't fit in the cache, and I'll see a
sudden slowdown.

What I actually see is:

With a series length of 56 to 128 bytes I get the highest speed.

With a series length of 500B to 1.5MB, I get a consistent speed of about
2/3 the highest speed.

Once the series length exceeds 1.5MB the speed drops, and is consistent
from then on. That I can see is main memory speed, and is about 40% of
the highest.

OK so far.

But...
Series length 8B is about the same as the 56 to 128 speed. Series length
16B is a bit less. Series length 32 is a lot less. Not as slow as main
memory, but not much more than half the peak speed. My next step up is
the peak speed. Series length 144 to 448 is slower still - slower in
fact than the main memory speed.

WTF?

I can post the code (C++, but not very complex) if that would help.

Andy
Theo
2023-12-20 18:08:47 UTC
Permalink
Post by Vir Campestris
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
Post by Vir Campestris
For reasons I won't go into I've been writing some code to evaluate
memory performance on my AMD Ryzen 5 3400G.
It says in the stuff I've found that each core has an 8-way set
associative L1 data cache of 128k (and an L1 instruction cache); an L2
cache of 512k, also set associative; and there's an L3 cache of 4MB.
To measure the performance I have three nested loops.
The innermost one goes around a loop incrementing a series of 64 bit
memory locations. The length of the series is set by the outermost loop.
The middle one repeats the innermost loop so that the number of memory
accesses is constant regardless of the series length.
The outermost one sets the series length. It starts at 1, and doubles it
each time.
I _thought_ what would happen is that as I increase the length of the
series after a while the data won't fit in the cache, and I'll see a
sudden slowdown.
With a series length of 56 to 128 bytes I get the highest speed.
With a series length of 500B to 1.5MB, I get a consistent speed of about
2/3 the highest speed.
Once the series length exceeds 1.5MB the speed drops, and is consistent
from then on. That I can see is main memory speed, and is about 40% of
the highest.
OK so far.
But...
Series length 8B is about the same as the 56 to 128 speed. Series length
16B is a bit less. Series length 32 is a lot less. Not as slow as main
memory, but not much more than half the peak speed. My next step up is
the peak speed. Series length 144 to 448 is slower still - slower in
fact than the main memory speed.
WTF?
I can post the code (C++, but not very complex) if that would help.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
bit word transferred?

What instruction sequences are being generated for the 8/16/32/64 byte
loops? I'm wondering if the compiler is using different instructions,
eg using MMX, SSE, AVX to do the operations. Maybe they are having
different caching behaviour?

It would help if you could tell us the compiler and platform you're using,
including version.

Theo
MitchAlsup
2023-12-20 18:58:35 UTC
Permalink
Post by Theo
Post by Vir Campestris
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
Post by Vir Campestris
For reasons I won't go into I've been writing some code to evaluate
memory performance on my AMD Ryzen 5 3400G.
It says in the stuff I've found that each core has an 8-way set
associative L1 data cache of 128k (and an L1 instruction cache); an L2
cache of 512k, also set associative; and there's an L3 cache of 4MB.
To measure the performance I have three nested loops.
The innermost one goes around a loop incrementing a series of 64 bit
memory locations. The length of the series is set by the outermost loop.
The middle one repeats the innermost loop so that the number of memory
accesses is constant regardless of the series length.
The outermost one sets the series length. It starts at 1, and doubles it
each time.
I _thought_ what would happen is that as I increase the length of the
series after a while the data won't fit in the cache, and I'll see a
sudden slowdown.
With a series length of 56 to 128 bytes I get the highest speed.
With a series length of 500B to 1.5MB, I get a consistent speed of about
2/3 the highest speed.
Once the series length exceeds 1.5MB the speed drops, and is consistent
from then on. That I can see is main memory speed, and is about 40% of
the highest.
OK so far.
But...
Series length 8B is about the same as the 56 to 128 speed. Series length
16B is a bit less. Series length 32 is a lot less. Not as slow as main
memory, but not much more than half the peak speed. My next step up is
the peak speed. Series length 144 to 448 is slower still - slower in
fact than the main memory speed.
WTF?
I can post the code (C++, but not very complex) if that would help.
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
bit word transferred?
What instruction sequences are being generated for the 8/16/32/64 byte
loops? I'm wondering if the compiler is using different instructions,
eg using MMX, SSE, AVX to do the operations. Maybe they are having
different caching behaviour?
It would help if you could tell us the compiler and platform you're using,
including version.
Theo
Can we see the code ??

Can you present a table of the timing results ??
Vir Campestris
2023-12-21 14:11:12 UTC
Permalink
Post by Theo
Post by Vir Campestris
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
<snip>
Post by Theo
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
bit word transferred?
Yes. My system has a 64 bit CPU and 64 bit main memory.
Post by Theo
What instruction sequences are being generated for the 8/16/32/64 byte
loops? I'm wondering if the compiler is using different instructions,
eg using MMX, SSE, AVX to do the operations. Maybe they are having
different caching behaviour?
It's running the same loop for each time, but with different values for
the loop sizes.
Post by Theo
It would help if you could tell us the compiler and platform you're using,
including version.
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Which of course tells you I'm running Ubuntu!
Post by Theo
Can we see the code ??
Can you present a table of the timing results ??
I've run this with more detailed increments on the line size, but here
are my results for powers of 2.

Size 1 gave 3.82242e+09 bytes/second.
Size 2 gave 3.80533e+09 bytes/second.
Size 4 gave 2.68017e+09 bytes/second.
Size 8 gave 2.33751e+09 bytes/second.
Size 16 gave 2.18424e+09 bytes/second.
Size 32 gave 2.10243e+09 bytes/second.
Size 64 gave 1.99371e+09 bytes/second.
Size 128 gave 1.98475e+09 bytes/second.
Size 256 gave 2.01653e+09 bytes/second.
Size 512 gave 2.00884e+09 bytes/second.
Size 1024 gave 2.02713e+09 bytes/second.
Size 2048 gave 2.01803e+09 bytes/second.
Size 4096 gave 3.26472e+09 bytes/second.
Size 8192 gave 3.85126e+09 bytes/second.
Size 16384 gave 3.85377e+09 bytes/second.
Size 32768 gave 3.85293e+09 bytes/second.
Size 65536 gave 2.06793e+09 bytes/second.
Size 131072 gave 2.06845e+09 bytes/second.


The code will continue, but the results are roughly stable for larger sizes.

The code I have put in a signature block; there's no point in risking
someone posting it again. I've commented it, but no doubt not in all the
right places! I'd be interested to know what results other people get.

Thanks
Andy
--
#include <chrono>
#include <iostream>
#include <vector>

int main()
{
// If your computer is much slower or faster than mine
// you may need to adjust this value.
constexpr size_t NextCount = 1 << 28;

std::vector<uint64_t> CacheStore(NextCount, 0);

// Get a raw pointer to the vector.
// On my machine (Ubuntu, g++) this improves
// performance. Using vector's operator[]
// results in a function call.
uint64_t *CachePtr = &CacheStore[0];

// SetSize is the count of the uint64_t items to be tested.
// I assume that when this is too big for a cache the data
// will overflow to the next level.
// Each loop it doubles in size. I've run it with smaller
// increments too, and the behaviour
// is still confusing.
for (auto SetSize = 1; SetSize < NextCount; SetSize<<=1)
{
size_t loopcount = 0;
size_t j = NextCount / SetSize;
auto start = std::chrono::steady_clock::now();

// The outer loop repeats enough times so that the data
// written by the inner loops of various sizes is
// approximately constant.
for (size_t k = 0; k < j; ++k)
{
// The inner loop modifies data
// within a set of words.
for (size_t l = 0; l < SetSize; ++l)
{
// read-modify-write some data.
// Comment this out
// to confirm that the looping is not
// the cause of the anomaly
++CachePtr[l];

// this counts the actual number
// of memory accesses.
// rounding errors means that for
// different SetSize values
// the count is not completely
// consistent.
++loopcount;
}
}

// Work out how long the loops took in microseconds,
// then scale to seconds
auto delta =
std::chrono::duration_cast<std::chrono::microseconds>
(std::chrono::steady_clock::now() - start).count()
/ 1e6;

// calculate how many bytes per second, and print.
std::cout << "Size " << SetSize << " gave "
<< (double)loopcount * (double)sizeof(uint64_t) /
delta << " bytes/second." << std::endl;
}

return 0;
}
Scott Lurndal
2023-12-21 14:56:44 UTC
Permalink
Post by Vir Campestris
Post by Theo
Post by Vir Campestris
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
<snip>
Post by Theo
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
bit word transferred?
Yes. My system has a 64 bit CPU and 64 bit main memory.
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
Vir Campestris
2023-12-21 15:32:10 UTC
Permalink
Post by Scott Lurndal
Post by Vir Campestris
Post by Theo
Post by Vir Campestris
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
<snip>
Post by Theo
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64
bit word transferred?
Yes. My system has a 64 bit CPU and 64 bit main memory.
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one back.

And
Michael S
2023-12-21 16:09:33 UTC
Permalink
On Thu, 21 Dec 2023 15:32:10 +0000
Post by Vir Campestris
Post by Scott Lurndal
Post by Vir Campestris
Post by Theo
Post by Vir Campestris
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
<snip>
Post by Theo
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
single 64 bit word transferred?
Yes. My system has a 64 bit CPU and 64 bit main memory.
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one back.
And
DIMMs have 64-bit data buses. Both these days and previous millennium.
Now, these days DDR5 DIMM splits 64-bit data bus into pair of
independent 32-bit channels, but the total is still 64 bits.

That's a physical perspective. From logical perspective, DDR3 and DDR4
bits exchange data with controller in 512-bit chunks. On DDR5 DIMM each
channel exchanges data with controller in 512-bit chunks.

From signaling perspective, it is still possible (at least on non-ECC
gear) to tell to memory to write just 8 bits out of 512 and to ignore
the rest. In PC-class hardware this ability is used very rarely if used
at all.
MitchAlsup
2023-12-21 18:36:42 UTC
Permalink
Post by Michael S
On Thu, 21 Dec 2023 15:32:10 +0000
Post by Vir Campestris
Post by Scott Lurndal
Post by Vir Campestris
Post by Theo
Post by Vir Campestris
This is not the right group for this - but I don't know where is.
Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
<snip>
Post by Theo
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
single 64 bit word transferred?
Yes. My system has a 64 bit CPU and 64 bit main memory.
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one back.
And
DIMMs have 64-bit data buses. Both these days and previous millennium.
Now, these days DDR5 DIMM splits 64-bit data bus into pair of
independent 32-bit channels, but the total is still 64 bits.
All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
Some SDR (pre 2000) DIMMs had 32-bit busses, some ancient laptop memory
carriers had 32-bit busses.
Post by Michael S
That's a physical perspective. From logical perspective, DDR3 and DDR4
bits exchange data with controller in 512-bit chunks. On DDR5 DIMM each
channel exchanges data with controller in 512-bit chunks.
Note:: 512-bis is 64-bytes.
Post by Michael S
From signaling perspective, it is still possible (at least on non-ECC
gear) to tell to memory to write just 8 bits out of 512 and to ignore
the rest. In PC-class hardware this ability is used very rarely if used
at all.
Never justified when CPU uses a cache. So, only unCacheable (sized)
requests can use this. AMD memory controllers hid this from the DRAMs
so we at least had the opportunity to recompute ECC on ECC carrying
DIMMs. MC would ask DRC for the line of data, check (and repair) ECC
then write the unCacheable data, and place the written data in the
outbound memory queue with its new ECC.
Michael S
2023-12-21 22:55:07 UTC
Permalink
On Thu, 21 Dec 2023 18:36:42 +0000
Post by MitchAlsup
Post by Michael S
On Thu, 21 Dec 2023 15:32:10 +0000
Post by Vir Campestris
Post by Scott Lurndal
Post by Vir Campestris
Post by Theo
Post by Vir Campestris
This is not the right group for this - but I don't know where
is. Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
<snip>
Post by Theo
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
single 64 bit word transferred?
Yes. My system has a 64 bit CPU and 64 bit main memory.
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up
on hardware these days, but it used to be if you wanted to write a
byte into your 16-bit memory you had to read both bytes, then
write one back.
And
DIMMs have 64-bit data buses. Both these days and previous
millennium. Now, these days DDR5 DIMM splits 64-bit data bus into
pair of independent 32-bit channels, but the total is still 64
bits.
All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
Some SDR (pre 2000) DIMMs had 32-bit busses,
Are you sure? My impression always was that the word DIMM was invented
to clearly separate 64-bit DIMMs from 32-bit SIMMs.
Post by MitchAlsup
some ancient laptop
memory carriers had 32-bit busses.
But were they called just DIMM or SO-DIMM?
The later indeed had 72-pin variant with 32-bit bus.
Post by MitchAlsup
Post by Michael S
That's a physical perspective. From logical perspective, DDR3 and
DDR4 bits exchange data with controller in 512-bit chunks. On DDR5
DIMM each channel exchanges data with controller in 512-bit chunks.
Note:: 512-bis is 64-bytes.
Which *not* co-incidentally matches cache line size of majority of x86
CPUs.
Post by MitchAlsup
Post by Michael S
From signaling perspective, it is still possible (at least on
non-ECC gear) to tell to memory to write just 8 bits out of 512 and
to ignore the rest. In PC-class hardware this ability is used very
rarely if used at all.
Never justified when CPU uses a cache. So, only unCacheable (sized)
requests can use this. AMD memory controllers hid this from the DRAMs
so we at least had the opportunity to recompute ECC on ECC carrying
DIMMs. MC would ask DRC for the line of data, check (and repair) ECC
then write the unCacheable data, and place the written data in the
outbound memory queue with its new ECC.
MitchAlsup
2023-12-21 23:49:20 UTC
Permalink
Post by Michael S
On Thu, 21 Dec 2023 18:36:42 +0000
Post by MitchAlsup
Post by Michael S
On Thu, 21 Dec 2023 15:32:10 +0000
Post by Vir Campestris
Post by Scott Lurndal
Post by Vir Campestris
Post by Theo
Post by Vir Campestris
This is not the right group for this - but I don't know where
is. Suggestions on a postcard please...
I'm crossposting this to comp.arch, where they may have some ideas.
<snip>
Post by Theo
For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a
single 64 bit word transferred?
Yes. My system has a 64 bit CPU and 64 bit main memory.
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up
on hardware these days, but it used to be if you wanted to write a
byte into your 16-bit memory you had to read both bytes, then
write one back.
And
DIMMs have 64-bit data buses. Both these days and previous
millennium. Now, these days DDR5 DIMM splits 64-bit data bus into
pair of independent 32-bit channels, but the total is still 64
bits.
All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
Some SDR (pre 2000) DIMMs had 32-bit busses,
Are you sure? My impression always was that the word DIMM was invented
to clearly separate 64-bit DIMMs from 32-bit SIMMs.
Dual In-Line Memory Module means they have pins on both sides of the
board where pins make contact with the plug. They put pins on both
sides so they could get ~200 pins {Vdd, Gnd, lock, reset, control,
and data} this was in the early 1990s.
Post by Michael S
Post by MitchAlsup
some ancient laptop
memory carriers had 32-bit busses.
But were they called just DIMM or SO-DIMM?
The later indeed had 72-pin variant with 32-bit bus.
That sounds right.
Post by Michael S
Post by MitchAlsup
Post by Michael S
That's a physical perspective. From logical perspective, DDR3 and
DDR4 bits exchange data with controller in 512-bit chunks. On DDR5
DIMM each channel exchanges data with controller in 512-bit chunks.
Note:: 512-bis is 64-bytes.
Which *not* co-incidentally matches cache line size of majority of x86
CPUs.
Given that x86 at the time the Std committee was doing the first one
represented 90% of all computers being sold, and the people on the
committee wanting to keep it that way, is unsurprising.
Post by Michael S
Post by MitchAlsup
Post by Michael S
From signaling perspective, it is still possible (at least on
non-ECC gear) to tell to memory to write just 8 bits out of 512 and
to ignore the rest. In PC-class hardware this ability is used very
rarely if used at all.
Never justified when CPU uses a cache. So, only unCacheable (sized)
requests can use this. AMD memory controllers hid this from the DRAMs
so we at least had the opportunity to recompute ECC on ECC carrying
DIMMs. MC would ask DRC for the line of data, check (and repair) ECC
then write the unCacheable data, and place the written data in the
outbound memory queue with its new ECC.
Andy Burns
2023-12-21 18:05:20 UTC
Permalink
Post by Vir Campestris
Post by Scott Lurndal
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one back.
I thought intel x64 machines work in cache lines of 64 bytes per memory
transaction ... maybe AMD processors are different?
Chris M. Thomasson
2023-12-21 21:20:59 UTC
Permalink
Post by Andy Burns
Post by Vir Campestris
Post by Scott Lurndal
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one back.
I thought intel x64 machines work in cache lines of 64 bytes per memory
transaction ... maybe AMD processors are different?
Remember when Intel had a false sharing problem when they had 128 cache
lines split into two 64 regions? Iirc, it was on some of their first
hyperthreaded processors. A work around from intel was to offset threads
using alloca... I remember it.
Chris M. Thomasson
2023-12-22 02:28:43 UTC
Permalink
Post by Chris M. Thomasson
Post by Andy Burns
Post by Vir Campestris
Post by Scott Lurndal
Surely your main memory is just a sequence of 8-bit bytes (+ECC).
AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on
hardware these days, but it used to be if you wanted to write a byte
into your 16-bit memory you had to read both bytes, then write one back.
I thought intel x64 machines work in cache lines of 64 bytes per
memory transaction ... maybe AMD processors are different?
Remember when Intel had a false sharing problem when they had 128 cache
lines split into two 64 regions? Iirc, it was on some of their first
hyperthreaded processors. A work around from intel was to offset threads
using alloca... I remember it.
It was a nightmare, however, at least the workaround did help a bit wrt
performance.
MitchAlsup
2023-12-21 18:30:47 UTC
Permalink
Post by Vir Campestris
Post by MitchAlsup
Can we see the code ??
Can you present a table of the timing results ??
I've run this with more detailed increments on the line size, but here
are my results for powers of 2.
{ A
Post by Vir Campestris
Size 1 gave 3.82242e+09 bytes/second.
Size 2 gave 3.80533e+09 bytes/second.
Size 4 gave 2.68017e+09 bytes/second.
Size 8 gave 2.33751e+09 bytes/second.
Size 16 gave 2.18424e+09 bytes/second.
Size 32 gave 2.10243e+09 bytes/second.
Size 64 gave 1.99371e+09 bytes/second.
Size 128 gave 1.98475e+09 bytes/second.
Size 256 gave 2.01653e+09 bytes/second.
Size 512 gave 2.00884e+09 bytes/second.
Size 1024 gave 2.02713e+09 bytes/second.
Size 2048 gave 2.01803e+09 bytes/second.
} A
{ B
Post by Vir Campestris
Size 4096 gave 3.26472e+09 bytes/second.
Size 8192 gave 3.85126e+09 bytes/second.
Size 16384 gave 3.85377e+09 bytes/second.
Size 32768 gave 3.85293e+09 bytes/second.
Size 65536 gave 2.06793e+09 bytes/second.
Size 131072 gave 2.06845e+09 bytes/second.
} B

A) Here we have a classical sequence pipelines often encounter where
a simple loop starts off fast 4 bytes/cycle and progressively slows
down by a factor of 2 (2 bytes per cycle) over some interval of
complexity (size). What is important is that factor of 2 something
that took 1 cycle early starts taking 2 cycles later on.

B) here we have a second classical sequence where the performance at
some boundary (4096) reverts back to the 1 cycle pipeline of performance
only to degrade again (in basically the same sequence as A). {{Side
note: size=4096 has "flutter" in the stairstep. size={8192..32768}
has peak performance--probably something to do with sets in the cache
and 4096 is the size of a page (TLB effects)}}.
I suspect the flutter has something to do with your buffer crossing
a page boundary.
Post by Vir Campestris
The code will continue, but the results are roughly stable for larger sizes.
The code I have put in a signature block; there's no point in risking
someone posting it again. I've commented it, but no doubt not in all the
right places! I'd be interested to know what results other people get.
Thanks
Andy
Vir Campestris
2023-12-22 13:42:39 UTC
Permalink
Post by MitchAlsup
Post by Vir Campestris
I've run this with more detailed increments on the line size, but here
are my results for powers of 2.
{ A
Post by Vir Campestris
Size 1 gave 3.82242e+09 bytes/second.
Size 2 gave 3.80533e+09 bytes/second.
Size 4 gave 2.68017e+09 bytes/second.
Size 8 gave 2.33751e+09 bytes/second.
Size 16 gave 2.18424e+09 bytes/second.
Size 32 gave 2.10243e+09 bytes/second.
Size 64 gave 1.99371e+09 bytes/second.
Size 128 gave 1.98475e+09 bytes/second.
Size 256 gave 2.01653e+09 bytes/second.
Size 512 gave 2.00884e+09 bytes/second.
Size 1024 gave 2.02713e+09 bytes/second.
Size 2048 gave 2.01803e+09 bytes/second.
} A
{ B
Post by Vir Campestris
Size 4096 gave 3.26472e+09 bytes/second.
Size 8192 gave 3.85126e+09 bytes/second.
Size 16384 gave 3.85377e+09 bytes/second.
Size 32768 gave 3.85293e+09 bytes/second.
Size 65536 gave 2.06793e+09 bytes/second.
Size 131072 gave 2.06845e+09 bytes/second.
} B
A) Here we have a classical sequence pipelines often encounter where
a simple loop starts off fast 4 bytes/cycle and progressively slows
down by a factor of 2 (2 bytes per cycle) over some interval of
complexity (size). What is important is that factor of 2 something
that took 1 cycle early starts taking 2 cycles later on.
B) here we have a second classical sequence where the performance at
some boundary (4096) reverts back to the 1 cycle pipeline of performance
only to degrade again (in basically the same sequence as A). {{Side
note: size=4096 has "flutter" in the stairstep. size={8192..32768}
has peak performance--probably something to do with sets in the cache
and 4096 is the size of a page (TLB effects)}}.
I suspect the flutter has something to do with your buffer crossing a
page boundary.
I ran it with the memory access commented out. When I do that the
overall time is consistent regardless of how many times it goes around
the inner loop, and how many times the outer one.

But pipelines are funny things.

64k x 64 bit words is I think my L3 cache size. It's not very big on my
CPU, they had to fit a GPU on the die.

I'd be interested to see the results from anyone else's computer.

Andy
Theo
2023-12-22 18:12:36 UTC
Permalink
Post by Vir Campestris
I'd be interested to see the results from anyone else's computer.
$ neofetch --off
OS: Kubuntu 23.10 x86_64
Host: 20QBS03N00 ThinkPad X1 Titanium Gen 1
Kernel: 6.5.0-14-generic
CPU: 11th Gen Intel i5-1130G7 (8) @ 4.000GHz
GPU: Intel Tiger Lake-UP4 GT2 [Iris Xe Graphics]
Memory: 4436MiB / 15704MiB
$ gcc --version
gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)

$ g++ -o cache cache.c
$ ./cache
Size 1 gave 4.13643e+09 bytes/second.
Size 2 gave 4.79971e+09 bytes/second.
Size 4 gave 4.87535e+09 bytes/second.
Size 8 gave 4.8321e+09 bytes/second.
Size 16 gave 4.71703e+09 bytes/second.
Size 32 gave 3.89488e+09 bytes/second.
Size 64 gave 4.02976e+09 bytes/second.
Size 128 gave 4.15832e+09 bytes/second.
Size 256 gave 4.19562e+09 bytes/second.
Size 512 gave 4.08511e+09 bytes/second.
Size 1024 gave 4.0796e+09 bytes/second.
Size 2048 gave 4.11983e+09 bytes/second.
Size 4096 gave 4.06869e+09 bytes/second.
Size 8192 gave 4.06807e+09 bytes/second.
Size 16384 gave 4.06217e+09 bytes/second.
Size 32768 gave 4.06067e+09 bytes/second.
Size 65536 gave 4.04791e+09 bytes/second.
Size 131072 gave 4.06143e+09 bytes/second.
Size 262144 gave 4.04301e+09 bytes/second.
Size 524288 gave 4.03872e+09 bytes/second.
Size 1048576 gave 3.97715e+09 bytes/second.
Size 2097152 gave 3.97609e+09 bytes/second.
Size 4194304 gave 3.98361e+09 bytes/second.
Size 8388608 gave 3.98617e+09 bytes/second.
Size 16777216 gave 3.98376e+09 bytes/second.
Size 33554432 gave 3.98504e+09 bytes/second.
Size 67108864 gave 3.98726e+09 bytes/second.
Size 134217728 gave 3.99495e+09 bytes/second.
MitchAlsup
2023-12-22 19:35:37 UTC
Permalink
Post by Theo
$ g++ -o cache cache.c
$ ./cache
{ A
Post by Theo
Size 1 gave 4.13643e+09 bytes/second.
Size 2 gave 4.79971e+09 bytes/second.
Size 4 gave 4.87535e+09 bytes/second.
Size 8 gave 4.8321e+09 bytes/second.
Size 16 gave 4.71703e+09 bytes/second.
} A
{ B
Post by Theo
Size 32 gave 3.89488e+09 bytes/second.
Size 64 gave 4.02976e+09 bytes/second.
Size 128 gave 4.15832e+09 bytes/second.
Size 256 gave 4.19562e+09 bytes/second.
Size 512 gave 4.08511e+09 bytes/second.
Size 1024 gave 4.0796e+09 bytes/second.
Size 2048 gave 4.11983e+09 bytes/second.
Size 4096 gave 4.06869e+09 bytes/second.
Size 8192 gave 4.06807e+09 bytes/second.
Size 16384 gave 4.06217e+09 bytes/second.
Size 32768 gave 4.06067e+09 bytes/second.
Size 65536 gave 4.04791e+09 bytes/second.
Size 131072 gave 4.06143e+09 bytes/second.
Size 262144 gave 4.04301e+09 bytes/second.
Size 524288 gave 4.03872e+09 bytes/second.
Size 1048576 gave 3.97715e+09 bytes/second.
Size 2097152 gave 3.97609e+09 bytes/second.
Size 4194304 gave 3.98361e+09 bytes/second.
Size 8388608 gave 3.98617e+09 bytes/second.
Size 16777216 gave 3.98376e+09 bytes/second.
Size 33554432 gave 3.98504e+09 bytes/second.
Size 67108864 gave 3.98726e+09 bytes/second.
Size 134217728 gave 3.99495e+09 bytes/second.
} B

The B group are essentially 4.0 with noise.
The A group are essentially 4.8 with a startup flutter.
A 20% down or 25% up change at 32::64.
Theo
2023-12-22 21:58:13 UTC
Permalink
Post by MitchAlsup
Post by Theo
$ g++ -o cache cache.c
$ ./cache
{ A
Post by Theo
Size 1 gave 4.13643e+09 bytes/second.
Size 2 gave 4.79971e+09 bytes/second.
Size 4 gave 4.87535e+09 bytes/second.
Size 8 gave 4.8321e+09 bytes/second.
Size 16 gave 4.71703e+09 bytes/second.
} A
{ B
Post by Theo
Size 32 gave 3.89488e+09 bytes/second.
...
Post by MitchAlsup
Post by Theo
Size 134217728 gave 3.99495e+09 bytes/second.
} B
The B group are essentially 4.0 with noise.
The A group are essentially 4.8 with a startup flutter.
A 20% down or 25% up change at 32::64.
The nearest machine I have to the OP is this:

OS: Ubuntu 20.04.6 LTS x86_64
Host: 1U4LW-X570/2L2T
Kernel: 5.4.0-167-generic
CPU: AMD Ryzen 7 5800X (16) @ 3.800GHz
GPU: 29:00.0 ASPEED Technology, Inc. ASPEED Graphics Family
Memory: 3332MiB / 128805MiB

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

Size 1 gave 6.22043e+09 bytes/second.
Size 2 gave 6.35674e+09 bytes/second.
Size 4 gave 4.14766e+09 bytes/second.
Size 8 gave 5.4239e+09 bytes/second.
Size 16 gave 6.01113e+09 bytes/second.
Size 32 gave 7.75976e+09 bytes/second.
Size 64 gave 8.7972e+09 bytes/second.
Size 128 gave 9.71523e+09 bytes/second.
Size 256 gave 9.91644e+09 bytes/second.
Size 512 gave 1.00179e+10 bytes/second.
Size 1024 gave 1.0065e+10 bytes/second.
Size 2048 gave 9.78508e+09 bytes/second.
Size 4096 gave 9.76764e+09 bytes/second.
Size 8192 gave 9.86537e+09 bytes/second.
Size 16384 gave 9.90053e+09 bytes/second.
Size 32768 gave 9.91552e+09 bytes/second.
Size 65536 gave 9.84556e+09 bytes/second.
Size 131072 gave 9.78442e+09 bytes/second.
Size 262144 gave 9.80282e+09 bytes/second.
Size 524288 gave 9.81447e+09 bytes/second.
Size 1048576 gave 9.81981e+09 bytes/second.
Size 2097152 gave 9.81456e+09 bytes/second.
Size 4194304 gave 9.70057e+09 bytes/second.
Size 8388608 gave 9.55507e+09 bytes/second.
Size 16777216 gave 9.44032e+09 bytes/second.
Size 33554432 gave 9.33896e+09 bytes/second.
Size 67108864 gave 9.28529e+09 bytes/second.
Size 134217728 gave 9.25213e+09 bytes/second.

which seems to have more flutter in group A. I get similar results in a VM
running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

Theo
MitchAlsup
2023-12-22 23:24:29 UTC
Permalink
Post by Theo
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
{ A
Post by Theo
Size 1 gave 6.22043e+09 bytes/second.
Size 2 gave 6.35674e+09 bytes/second.
Size 4 gave 4.14766e+09 bytes/second.
Size 8 gave 5.4239e+09 bytes/second.
Size 16 gave 6.01113e+09 bytes/second.
Size 32 gave 7.75976e+09 bytes/second.
Size 64 gave 8.7972e+09 bytes/second.
} A
{ B
Post by Theo
Size 128 gave 9.71523e+09 bytes/second.
Size 256 gave 9.91644e+09 bytes/second.
Size 512 gave 1.00179e+10 bytes/second.
Size 1024 gave 1.0065e+10 bytes/second.
Size 2048 gave 9.78508e+09 bytes/second.
Size 4096 gave 9.76764e+09 bytes/second.
Size 8192 gave 9.86537e+09 bytes/second.
Size 16384 gave 9.90053e+09 bytes/second.
Size 32768 gave 9.91552e+09 bytes/second.
Size 65536 gave 9.84556e+09 bytes/second.
Size 131072 gave 9.78442e+09 bytes/second.
Size 262144 gave 9.80282e+09 bytes/second.
Size 524288 gave 9.81447e+09 bytes/second.
Size 1048576 gave 9.81981e+09 bytes/second.
Size 2097152 gave 9.81456e+09 bytes/second.
} B
{ C
Post by Theo
Size 4194304 gave 9.70057e+09 bytes/second.
Size 8388608 gave 9.55507e+09 bytes/second.
Size 16777216 gave 9.44032e+09 bytes/second.
Size 33554432 gave 9.33896e+09 bytes/second.
Size 67108864 gave 9.28529e+09 bytes/second.
Size 134217728 gave 9.25213e+09 bytes/second.
} C
Post by Theo
which seems to have more flutter in group A. I get similar results in a VM
A has some kind of startup irregularity: down then up.
B is essentially constant
C is slowly degrading (maybe from TLB effects}
Post by Theo
Theo
Vir Campestris
2023-12-23 21:34:14 UTC
Permalink
On 22/12/2023 21:58, Theo wrote:
<snip>
Post by Theo
which seems to have more flutter in group A. I get similar results in a VM
Theo
Thank you for those. Neither of them show the odd thing I was seeing -
but I compiled with -ofast. Does that make a difference?

Andy
Thomas Koenig
2023-12-23 22:50:10 UTC
Permalink
Post by Vir Campestris
<snip>
Post by Theo
which seems to have more flutter in group A. I get similar results
in a VM
Post by Theo
Theo
Thank you for those. Neither of them show the odd thing I was seeing -
but I compiled with -ofast. Does that make a difference?
That would put it in an executable named "fast" :-)

-march=native -mtune=native might make more of a difference, depending
on how good the compiler's model of your hardware is.
David Brown
2023-12-25 14:56:19 UTC
Permalink
Post by Thomas Koenig
Post by Vir Campestris
<snip>
Post by Theo
which seems to have more flutter in group A. I get similar results
in a VM
Post by Theo
Theo
Thank you for those. Neither of them show the odd thing I was seeing -
but I compiled with -ofast. Does that make a difference?
That would put it in an executable named "fast" :-)
Some programs are really fussy about capitalisation!
Post by Thomas Koenig
-march=native -mtune=native might make more of a difference, depending
on how good the compiler's model of your hardware is.
For gcc on modern x86-64 processors (IME at least), the difference
between -O0 and -O1 is often large, but the difference between -O1 and
higher levels usually makes far less difference. The processors
themselves do a good job of things like instruction scheduling and
register renaming at runtime, and are designed to be good for running
weakly optimised code. I find it makes a bigger difference on
processors that don't do as much at run-time, such as microcontroller cores.

But the "-march=native" can make a very big difference, especially if it
means the compiler can use SIMD or other advanced instructions. (You
don't need "-mtune" if you have "march", as it is implied by "-march" -
you only need both if you want to make a build that will run on many
x86-64 variants but is optimised for one particular one.) And the
"-march=native" benefits go well with some of the higher optimisations -
thus it is good to combine "-march=native" with "-Ofast" or "-O2".

In practice, things also vary dramatically according to the type of program.
Vir Campestris
2023-12-27 14:20:08 UTC
Permalink
Post by David Brown
Post by Thomas Koenig
Post by Vir Campestris
<snip>
which seems to have more flutter in group A.  I get similar results
in a VM
Theo
Thank you for those. Neither of them show the odd thing I was seeing -
but I compiled with -ofast. Does that make a difference?
That would put it in an executable named "fast" :-)
YKWIM :P
Post by David Brown
Some programs are really fussy about capitalisation!
Post by Thomas Koenig
-march=native -mtune=native might make more of a difference, depending
on how good the compiler's model of your hardware is.
For gcc on modern x86-64 processors (IME at least), the difference
between -O0 and -O1 is often large, but the difference between -O1 and
higher levels usually makes far less difference.  The processors
themselves do a good job of things like instruction scheduling and
register renaming at runtime, and are designed to be good for running
weakly optimised code.  I find it makes a bigger difference on
processors that don't do as much at run-time, such as microcontroller cores.
But the "-march=native" can make a very big difference, especially if it
means the compiler can use SIMD or other advanced instructions.  (You
don't need "-mtune" if you have "march", as it is implied by "-march" -
you only need both if you want to make a build that will run on many
x86-64 variants but is optimised for one particular one.)  And the
"-march=native" benefits go well with some of the higher optimisations -
thus it is good to combine "-march=native" with "-Ofast" or "-O2".
In practice, things also vary dramatically according to the type of program.
mtune=native _does_ make a difference. Somewhat to my surprise it makes
it _slower_ - the biggest difference being about 5% with a size of 128k
UINT64.

I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster too)
Ofast (mostly slightly faster than O3 when I use a capital O) and O3
march=native.

The pipeline thing made me try something different - instead of
incrementing a value I used std::copy to copy each word to the next one.

That rose to a peak rate of about 64MB/S with a size of 32k, dropped
sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k. It
then dropped sharply to 10MB/S for all larger sizes.

A much more sensible result!

Andy
David Brown
2023-12-27 15:02:11 UTC
Permalink
Post by Vir Campestris
Post by David Brown
Post by Thomas Koenig
Post by Vir Campestris
<snip>
which seems to have more flutter in group A.  I get similar results
in a VM
Theo
Thank you for those. Neither of them show the odd thing I was seeing -
but I compiled with -ofast. Does that make a difference?
That would put it in an executable named "fast" :-)
YKWIM :P
Post by David Brown
Some programs are really fussy about capitalisation!
Post by Thomas Koenig
-march=native -mtune=native might make more of a difference, depending
on how good the compiler's model of your hardware is.
For gcc on modern x86-64 processors (IME at least), the difference
between -O0 and -O1 is often large, but the difference between -O1 and
higher levels usually makes far less difference.  The processors
themselves do a good job of things like instruction scheduling and
register renaming at runtime, and are designed to be good for running
weakly optimised code.  I find it makes a bigger difference on
processors that don't do as much at run-time, such as microcontroller cores.
But the "-march=native" can make a very big difference, especially if
it means the compiler can use SIMD or other advanced instructions.
(You don't need "-mtune" if you have "march", as it is implied by
"-march" - you only need both if you want to make a build that will
run on many x86-64 variants but is optimised for one particular one.)
And the "-march=native" benefits go well with some of the higher
optimisations - thus it is good to combine "-march=native" with
"-Ofast" or "-O2".
In practice, things also vary dramatically according to the type of program.
mtune=native _does_ make a difference. Somewhat to my surprise it makes
it _slower_ - the biggest difference being about 5% with a size of 128k
UINT64.
I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster too)
Ofast (mostly slightly faster than O3 when I use a capital O) and O3
march=native.
"-Ofast" is like -O3, but additionally allows the compiler to "bend" the
rules somewhat. For example, it allows optimisations of stores that
might cause data races if another thread can see the same data, and it
enables "-ffast-math" which treats floating point as somewhat
approximate and always finite, rather than following IEEE rules
strictly. If you are okay with this, then "-Ofast" is fine, but you
should read the documentation to check that you are happy with it.

While you are there, you can see a number of other optimisation flags
that are not enabled by any -O sets. These may or may not help,
depending on your source code.
Post by Vir Campestris
The pipeline thing made me try something different - instead of
incrementing a value I used std::copy to copy each word to the next one.
That rose to a peak rate of about 64MB/S with a size of 32k, dropped
sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k. It
then dropped sharply to 10MB/S for all larger sizes.
A much more sensible result!
Andy
Michael S
2023-12-27 16:38:42 UTC
Permalink
On Wed, 27 Dec 2023 14:20:08 +0000
Post by Vir Campestris
Post by David Brown
Post by Thomas Koenig
Post by Vir Campestris
<snip>
which seems to have more flutter in group A.  I get similar results
in a VM
Theo
Thank you for those. Neither of them show the odd thing I was
seeing - but I compiled with -ofast. Does that make a difference?
That would put it in an executable named "fast" :-)
YKWIM :P
Post by David Brown
Some programs are really fussy about capitalisation!
Post by Thomas Koenig
-march=native -mtune=native might make more of a difference,
depending on how good the compiler's model of your hardware is.
For gcc on modern x86-64 processors (IME at least), the difference
between -O0 and -O1 is often large, but the difference between -O1
and higher levels usually makes far less difference.  The
processors themselves do a good job of things like instruction
scheduling and register renaming at runtime, and are designed to be
good for running weakly optimised code.  I find it makes a bigger
difference on processors that don't do as much at run-time, such as
microcontroller cores.
But the "-march=native" can make a very big difference, especially
if it means the compiler can use SIMD or other advanced
instructions.  (You don't need "-mtune" if you have "march", as it
is implied by "-march" - you only need both if you want to make a
build that will run on many x86-64 variants but is optimised for
one particular one.)  And the "-march=native" benefits go well with
some of the higher optimisations - thus it is good to combine
"-march=native" with "-Ofast" or "-O2".
In practice, things also vary dramatically according to the type of program.
mtune=native _does_ make a difference. Somewhat to my surprise it
makes it _slower_ - the biggest difference being about 5% with a size
of 128k UINT64.
I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster
too) Ofast (mostly slightly faster than O3 when I use a capital O)
and O3 march=native.
The pipeline thing made me try something different - instead of
incrementing a value I used std::copy to copy each word to the next one.
That rose to a peak rate of about 64MB/S with a size of 32k, dropped
sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k.
It then dropped sharply to 10MB/S for all larger sizes.
A much more sensible result!
GB rather than MB, hopefully
Post by Vir Campestris
Andy
Can you tell us what are you trying to achieve?
Is it a microbenchmark intended to improve your understanding of Zen1
internals?
Or part of the code that you want to run as fast as possible?
If the later, what exactly do you want to be done by the code?
Vir Campestris
2023-12-29 17:54:53 UTC
Permalink
Post by Michael S
On Wed, 27 Dec 2023 14:20:08 +0000
Post by Vir Campestris
That rose to a peak rate of about 64MB/S with a size of 32k, dropped
sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k.
It then dropped sharply to 10MB/S for all larger sizes.
A much more sensible result!
GB rather than MB, hopefully
Post by Vir Campestris
Andy
Can you tell us what are you trying to achieve?
Is it a microbenchmark intended to improve your understanding of Zen1
internals?
Or part of the code that you want to run as fast as possible?
If the later, what exactly do you want to be done by the code?
It is a microbenchmark.

I was trying to understand cache performance on my system.

Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

I and another poster are trying to optimise it. Not for any good reason
of course... but I was just curious about some of the results I was getting.

Andy
Andy Burns
2023-12-29 18:19:09 UTC
Permalink
Post by Vir Campestris
It is a microbenchmark.
I was trying to understand cache performance on my system.
Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.
I and another poster are trying to optimise it. Not for any good reason
of course... but I was just curious about some of the results I was getting.
A few years ago this guy popped-up on youtube, saying "I was the author
of Task Manager on Windows NT", I ignored the recommendation for quite a
while thinking he just wanted people to blow smoke up his arse,
eventually I watched and he seems a decent guy ... anyway he did a drag
racing series using sieve prime algorithm

<https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>

Others have contributed in all manner of languages

<https://github.com/PlummersSoftwareLLC/Primes>
Terje Mathisen
2023-12-30 13:57:55 UTC
Permalink
Post by Andy Burns
Post by Vir Campestris
It is a microbenchmark.
I was trying to understand cache performance on my system.
Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.
I and another poster are trying to optimise it. Not for any good
reason of course... but I was just curious about some of the results I
was getting.
A few years ago this guy popped-up on youtube, saying "I was the author
of Task Manager on Windows NT", I ignored the recommendation for quite a
while thinking he just wanted people to blow smoke up his arse,
eventually I watched and he seems a decent guy ... anyway he did a drag
racing series using sieve prime algorithm
<https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>
Dave is a Good Guy!

I have contributed to his "smallest windows program ever" with a small
tweak that saved 4 (or 6?) bytes.

The sieve algorithm used above is more advanced than the fixed modulo 30
program I wrote a ffew decades ago, but similar in approach.

I decided to use mod 30 because as soon as you get to 30+, there are
exactly 8 possible prime locations in each block, so it fits perfectly
in a byte map. :-)

1,7,11,13,17,19,23,29
Post by Andy Burns
Others have contributed in all manner of languages
<https://github.com/PlummersSoftwareLLC/Primes>
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
BGB
2023-12-30 18:24:44 UTC
Permalink
Post by Terje Mathisen
Post by Andy Burns
Post by Vir Campestris
It is a microbenchmark.
I was trying to understand cache performance on my system.
Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.
I and another poster are trying to optimise it. Not for any good
reason of course... but I was just curious about some of the results
I was getting.
A few years ago this guy popped-up on youtube, saying "I was the
author of Task Manager on Windows NT", I ignored the recommendation
for quite a while thinking he just wanted people to blow smoke up his
arse, eventually I watched and he seems a decent guy ... anyway he did
a drag racing series using sieve prime algorithm
<https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>
Dave is a Good Guy!
Yes, and generally interesting videos on YouTube.

He did interviews with David Cutler and Raymond Chen as well, both
fairly big names at Microsoft, which were also kinda interesting.
Post by Terje Mathisen
I have contributed to his "smallest windows program ever" with a small
tweak that saved 4 (or 6?) bytes.
The sieve algorithm used above is more advanced than the fixed modulo 30
program I wrote a ffew decades ago, but similar in approach.
I decided to use mod 30 because as soon as you get to 30+, there are
exactly 8 possible prime locations in each block, so it fits perfectly
in a byte map. :-)
1,7,11,13,17,19,23,29
I before used a prime sieve to test static vs dynamic types performance
in my compiler, and noted that the relative performance difference was
smaller than might otherwise be expected (intuitively, one would expect
dynamic types to be horridly slow, since nearly every operation is
effectively a runtime call).

Namely, I only saw around a 3x difference at the time, rather than
around a 10x+ difference, which is more what I would expect.


Granted, a prime sieve is also not exactly something that performs well
on my ISA design (but, slightly less "aggressively badly" than a the
"recursive Fibonacci number" algorithm, *).

*, IIRC:
long rfib(long x)
{ return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
Which is basically a stress test of function call and return performance...
Post by Terje Mathisen
Post by Andy Burns
Others have contributed in all manner of languages
<https://github.com/PlummersSoftwareLLC/Primes>
Terje
MitchAlsup
2023-12-30 22:05:30 UTC
Permalink
Post by BGB
long rfib(long x)
{ return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
At first I thought the following looked pretty good:

rfib:
ENTER R29,R0,#0 // just 3 preserved registers.

CMP R2,R1,#2
PGE R2,TT
MOV R1,#1 // return 1
EXIT R29,R0,#0

SUB R30,R1,#2
SUB R1,R1,#1
CALL rfib // rfib(n-1)
MOV R29,R1
MOV R1,R30
CALL rfib // rfib(n-2)

ADD R1,R1,R29
EXIT R29,R0,#0 // return rfib(n-1)+rfib(n-2)

But then I moved save and restore to the rfib() calls making the last rung
on the call tree less expensive by 6 memory references each, and by saving
restoring only 2 registers rather than 3 at the cost of a MOV, so each non-
leaf level in the call tree saves/restores only 4 registers instead of 6::

rfib:
SUB R2,R1,#2
PNLT R2,TT // (n-2) < 0
MOV R1,#1 // return 1
RET

ENTER R30,R0,#0 // just 2 preserved registers.
MOV R30,R2
SUB R1,R1,#1
CALL rfib // rfib(n-1)
MOV R2,R1
MOV R1,R30
MOV R30,R2
CALL rfib // rfib(n-2)

ADD R1,R1,R30
EXIT R30,R0,#0 // return rfib(n-1)+rfib(n-2)

But then I realized that rfib(n-2) has already been computed by rfib(n-1).
Since the size of the redundant element on the stack is constant, rfib(n-1)
makes a location available to rfib(n-2) so only 1 call is required instead
of 2::

rfib:
ENTER R0,R0,#8
STD #0,[SP+8] // setup rfib[n-2]
CALL rfib1 // recurse through rfib1
EXIT R0,R0,#8

rfib1:
SUB R2,R1,#2
PNLT R2,TT // (n-2) < 0
MOV R1,#1 // return 1
RET

ENTER R0,R0,#8 // no preserved registers just return address

SUB R1,R1,#1
CALL rfib1 // rfib(n-1)
LDD R2,[SP] // save rfib(n-2)

ADD R1,R1,R30
STD R1,[SP+16] // restore rfib(n-2)
EXIT R0,R0,#8 // return rfib(n-1)+rfib(n-2)

....
BGB
2023-12-30 23:19:12 UTC
Permalink
   long rfib(long x)
     { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
     ENTER   R29,R0,#0          // just 3 preserved registers.
     CMP     R2,R1,#2
     PGE     R2,TT
     MOV     R1,#1              // return 1
     EXIT    R29,R0,#0
     SUB     R30,R1,#2
     SUB     R1,R1,#1
     CALL    rfib               // rfib(n-1)
     MOV     R29,R1
     MOV     R1,R30
     CALL    rfib               // rfib(n-2)
     ADD     R1,R1,R29
     EXIT    R29,R0,#0          // return rfib(n-1)+rfib(n-2)
Theoretically, could be done in my ISA something like:
rfib:
ADD -24, SP | MOV LR, R1
MOV 1, R2 | CMPGE 2, R4
BF .L0
MOV.X R12, (SP, 0)
MOV R4, R12 | MOV.Q R1, (SP, 16)
ADD R12, -1, R4
BSR rfib
MOV R2, R13 | ADD R12, -2, R4
BSR rfib
ADD R2, R13, R2
MOV.Q (SP, 16), R1
MOV.X (SP, 0), R12
.L0:
ADD 24, SP
JMP R1

But, my compiler isn't going to pull off anything like this.


Checking, actual BGBCC output:
rfib:
MOV LR, R1
BSR __prolog_0000_0000020000FC
ADD -208, SP
MOV RQ4, RQ13
// tk_shell_chksvar.c:234 { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
CMPQGE 2, RQ13
BT .L00802A62
MOV 1, RQ10
BRA .L00802A63

.L00802A62:
SUB RQ13, 1, RQ14
MOV RQ14, RQ4
BSR rfib
MOV RQ2, RQ12
SUB RQ13, 2, RQ14
MOV RQ14, RQ4
BSR rfib
MOV RQ2, RQ11
ADD RQ12, RQ11, RQ14
MOV RQ14, RQ10

.L00802A63:
MOV RQ10, RQ2

.L00C108CE:
ADD 208, SP
BRA __epilog_0000_0000020000FC
.balign 4

( Yes, this is more MOV's than "technically necessary", but eliminating
them from the compiler output is "easier said than done" given how some
parts of the compiler work. RQn/RDn are technically equivalent to Rn,
but the compiler distinguishes them as early on the idea was that the
assembler might distinguish instructions based on type, like
EAX/RAX/AX/... on x86-64, but it didn't quite happen this way. )

And, fetching the prolog and epilog:
__prolog_0000_0000020000FC:
ADD -48, SP
MOV.Q R1, (SP, 40)
MOV.X R12, (SP, 16)
MOV.Q R14, (SP, 32)
MOV.X R10, (SP, 0)
RTS

__epilog_0000_0000020000FC:
MOV.Q (SP, 40), R1
MOV.X (SP, 0), R10
MOV.X (SP, 16), R12
MOV.Q (SP, 32), R14
ADD 48, SP
JMP R1


Or, a little closer to the final machine-code (BJX2 Baseline):

rfib: //@03E27C
4911 STC R1, R1
.reloc __prolog_0000_0000020000FC 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F2F1_D330 ADD -208, SP
18D4 MOV R4, R13
F2DA_C802 CMPQGE 2, R13
.reloc .L00802A62 0F/RELW20_BJX
E000_C000 BT (PC, 0)
DA01 MOV 1, R10
.reloc .L00802A63 0F/RELW20_BJX
F000_C000 BRA (PC, 0)
3000 NOP

.L00802A62: //@03E298
F2ED_11FF ADD R13, -1, R14
F04E_1089 MOV R14, R4
.reloc rfib 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F4C2_1089 MOV R2, R12 |
F2ED_11FE ADD R13, -2, R14
F04E_1089 MOV R14, R4
.reloc rfib 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F0B2_1089 MOV R2, R11
F0EC_10B0 ADD R12, R11, R14
F0AE_1089 MOV R14, R10

.L00802A63: //@03E2C0
182A MOV R10, R2

.L00C108CE: //@03E2C2
F2F0_D0D0 ADD 208, SP
.reloc __epilog_0000_0000020000FC 0F/RELW20_BJX
F000_C000 BRA (PC, 0)
3030 BRK
But then I moved save and restore to the rfib() calls making the last rung
on the call tree less expensive by 6 memory references each, and by saving
restoring only 2 registers rather than 3 at the cost of a MOV, so each non-
     SUB     R2,R1,#2
     PNLT    R2,TT              // (n-2) < 0
     MOV     R1,#1              // return 1
     RET
     ENTER   R30,R0,#0          // just 2 preserved registers.
     MOV     R30,R2
     SUB     R1,R1,#1
     CALL    rfib               // rfib(n-1)
     MOV     R2,R1
     MOV     R1,R30
     MOV     R30,R2
     CALL    rfib               // rfib(n-2)
     ADD     R1,R1,R30
     EXIT    R30,R0,#0          // return rfib(n-1)+rfib(n-2)
But then I realized that rfib(n-2) has already been computed by rfib(n-1).
Since the size of the redundant element on the stack is constant, rfib(n-1)
makes a location available to rfib(n-2) so only 1 call is required instead
     ENTER   R0,R0,#8
     STD     #0,[SP+8]        // setup rfib[n-2]
     CALL    rfib1             // recurse through rfib1
     EXIT    R0,R0,#8
     SUB     R2,R1,#2
     PNLT    R2,TT              // (n-2) < 0
     MOV     R1,#1              // return 1
     RET
     ENTER   R0,R0,#8           // no preserved registers just return
address
     SUB     R1,R1,#1
     CALL    rfib1              // rfib(n-1)
     LDD     R2,[SP]            // save rfib(n-2)
     ADD     R1,R1,R30
     STD     R1,[SP+16]         // restore rfib(n-2)
     EXIT    R0,R0,#8           // return rfib(n-1)+rfib(n-2)
....
If you cache the intermediate results of the calculations, this entirely
defeats its use as a benchmark...

Say, because it drops from ~ O(2^n) to ~ O(n)...

Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
MitchAlsup
2023-12-30 23:55:52 UTC
Permalink
Post by BGB
   long rfib(long x)
     { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
     ENTER   R29,R0,#0          // just 3 preserved registers.
     CMP     R2,R1,#2
     PGE     R2,TT
     MOV     R1,#1              // return 1
     EXIT    R29,R0,#0
     SUB     R30,R1,#2
     SUB     R1,R1,#1
     CALL    rfib               // rfib(n-1)
     MOV     R29,R1
     MOV     R1,R30
     CALL    rfib               // rfib(n-2)
     ADD     R1,R1,R29
     EXIT    R29,R0,#0          // return rfib(n-1)+rfib(n-2)
ADD -24, SP | MOV LR, R1
MOV 1, R2 | CMPGE 2, R4
BF .L0
MOV.X R12, (SP, 0)
MOV R4, R12 | MOV.Q R1, (SP, 16)
ADD R12, -1, R4
BSR rfib
MOV R2, R13 | ADD R12, -2, R4
BSR rfib
ADD R2, R13, R2
MOV.Q (SP, 16), R1
MOV.X (SP, 0), R12
ADD 24, SP
JMP R1
But, my compiler isn't going to pull off anything like this.
MOV LR, R1
BSR __prolog_0000_0000020000FC
ADD -208, SP
MOV RQ4, RQ13
// tk_shell_chksvar.c:234 { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
CMPQGE 2, RQ13
BT .L00802A62
MOV 1, RQ10
BRA .L00802A63
SUB RQ13, 1, RQ14
MOV RQ14, RQ4
BSR rfib
MOV RQ2, RQ12
SUB RQ13, 2, RQ14
MOV RQ14, RQ4
BSR rfib
MOV RQ2, RQ11
ADD RQ12, RQ11, RQ14
MOV RQ14, RQ10
MOV RQ10, RQ2
ADD 208, SP
BRA __epilog_0000_0000020000FC
.balign 4
( Yes, this is more MOV's than "technically necessary", but eliminating
them from the compiler output is "easier said than done" given how some
parts of the compiler work. RQn/RDn are technically equivalent to Rn,
but the compiler distinguishes them as early on the idea was that the
assembler might distinguish instructions based on type, like
EAX/RAX/AX/... on x86-64, but it didn't quite happen this way. )
ADD -48, SP
MOV.Q R1, (SP, 40)
MOV.X R12, (SP, 16)
MOV.Q R14, (SP, 32)
MOV.X R10, (SP, 0)
RTS
MOV.Q (SP, 40), R1
MOV.X (SP, 0), R10
MOV.X (SP, 16), R12
MOV.Q (SP, 32), R14
ADD 48, SP
JMP R1
4911 STC R1, R1
..reloc __prolog_0000_0000020000FC 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F2F1_D330 ADD -208, SP
18D4 MOV R4, R13
F2DA_C802 CMPQGE 2, R13
..reloc .L00802A62 0F/RELW20_BJX
E000_C000 BT (PC, 0)
DA01 MOV 1, R10
..reloc .L00802A63 0F/RELW20_BJX
F000_C000 BRA (PC, 0)
3000 NOP
F2ED_11FF ADD R13, -1, R14
F04E_1089 MOV R14, R4
..reloc rfib 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F4C2_1089 MOV R2, R12 |
F2ED_11FE ADD R13, -2, R14
F04E_1089 MOV R14, R4
..reloc rfib 0F/RELW20_BJX
F000_D000 BSR (PC, 0)
F0B2_1089 MOV R2, R11
F0EC_10B0 ADD R12, R11, R14
F0AE_1089 MOV R14, R10
182A MOV R10, R2
F2F0_D0D0 ADD 208, SP
..reloc __epilog_0000_0000020000FC 0F/RELW20_BJX
F000_C000 BRA (PC, 0)
3030 BRK
But then I moved save and restore to the rfib() calls making the last rung
on the call tree less expensive by 6 memory references each, and by saving
restoring only 2 registers rather than 3 at the cost of a MOV, so each non-
     SUB     R2,R1,#2
     PNLT    R2,TT              // (n-2) < 0
     MOV     R1,#1              // return 1
     RET
     ENTER   R30,R0,#0          // just 2 preserved registers.
     MOV     R30,R2
     SUB     R1,R1,#1
     CALL    rfib               // rfib(n-1)
     MOV     R2,R1
     MOV     R1,R30
     MOV     R30,R2
     CALL    rfib               // rfib(n-2)
     ADD     R1,R1,R30
     EXIT    R30,R0,#0          // return rfib(n-1)+rfib(n-2)
But then I realized that rfib(n-2) has already been computed by rfib(n-1).
Since the size of the redundant element on the stack is constant, rfib(n-1)
makes a location available to rfib(n-2) so only 1 call is required instead
     ENTER   R0,R0,#8
     STD     #0,[SP+8]        // setup rfib[n-2]
     CALL    rfib1             // recurse through rfib1
     EXIT    R0,R0,#8
     SUB     R2,R1,#2
     PNLT    R2,TT              // (n-2) < 0
     MOV     R1,#1              // return 1
     RET
     ENTER   R0,R0,#8           // no preserved registers just return
address
     SUB     R1,R1,#1
     CALL    rfib1              // rfib(n-1)
     LDD     R2,[SP]            // save rfib(n-2)
     ADD     R1,R1,R30
     STD     R1,[SP+16]         // restore rfib(n-2)
     EXIT    R0,R0,#8           // return rfib(n-1)+rfib(n-2)
....
If you cache the intermediate results of the calculations, this entirely
defeats its use as a benchmark...
Exactly:: but there is talk of compilers figuring out that rfib() is a
pure function and then the compiler itself performs that transformation.

{{Oh and BTW: Fibonacci should have never been a benchmark (post 1985)}}
Post by BGB
Say, because it drops from ~ O(2^n) to ~ O(n)...
It also reduces the constant term in BigO ! since the code path is shorter.

Back in 1982±, S.E.L. got their compiler to recognize the transcendental
loop as constant, and this one transformation improved Whetstone by 3×.
Post by BGB
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Make a file as big as you like, and simply index.

BigO( const )
Pancho
2023-12-31 09:12:12 UTC
Permalink
Post by BGB
Granted, There are plenty of other much more efficient ways of
calculating Fibonacci numbers...
Yes. In order of effort.

1) Memoize the recursive function.
2) Return values for (n-1,n-2) as a tuple, and make it single (tail)
recursive. Which a smart compiler could theoretically convert to an
iterative loop.
3) Write an iterative loop yourself.
4) Remember college maths, and solve the recurrence relation to give a
very simple closed form solution.

In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only “bad” solution is the naive
double recursive one.
Daniel James
2023-12-31 15:45:32 UTC
Permalink
Post by Pancho
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only “bad” solution is the naive
double recursive one.
Well, I might disagree.

Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you need
to compute it often enough that efficiency might become relevant just
put the results in a table and use that. Given that fib goes over 10^12
after about 60 terms it needn't be a very big table.

The naive doubly recursive solution has the huge merit of being
human-readable.

Unlike, say:

double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}

... which is great for fibs of big values, but not fast for small ones.
--
Cheers,
Daniel.
Pancho
2023-12-31 16:40:09 UTC
Permalink
Post by Daniel James
Post by Pancho
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only “bad” solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you need
to compute it often enough that efficiency might become relevant just
put the results in a table and use that. Given that fib goes over 10^12
after about 60 terms it needn't be a very big table.
Memoization, is effectively a lookup table, but without the thinking. In
my case, avoiding thinking massively improves code reliability.
Post by Daniel James
The naive doubly recursive solution has the huge merit of being
human-readable.
Yes, but you'd be surprised how quickly double recursion goes bad. I
haven't tested recently, but off the top of my head, the maximum
function call depth would be something like 300, so fib(10) would be a
stack overflow. fib(60) would need a stack function call depth ~ 1e18.
Post by Daniel James
  double fib( int n )
  {
      return (pow( (1+sqrt(5.0))/2, n+1 )
              - pow( (1-sqrt(5.0))/2, n+1 ))
          / sqrt(5.0);
  }
... which is great for fibs of big values, but not fast for small ones.
Why isn't it fast for small fibs?

I seem to recall, if statements are more costly than floating point
function calls, like exponential.
Terje Mathisen
2023-12-31 17:03:42 UTC
Permalink
Post by Pancho
Post by Daniel James
Post by Pancho
In fact, given that fib grows exponentially, so we only ever need to
calculate it for small values of n, the only “bad” solution is the
naive double recursive one.
Well, I might disagree.
Given that we only ever need to calculate it for small values of n (as
you say, and if that's true) its efficiency doesn't matter. If you
need to compute it often enough that efficiency might become relevant
just put the results in a table and use that. Given that fib goes over
10^12 after about 60 terms it needn't be a very big table.
Memoization, is effectively a lookup table, but without the thinking. In
my case, avoiding thinking massively improves code reliability.
Post by Daniel James
The naive doubly recursive solution has the huge merit of being
human-readable.
Yes, but you'd be surprised how quickly double recursion goes bad. I
haven't tested recently, but off the top of my head, the maximum
function call depth would be something like 300, so fib(10) would be a
stack overflow. fib(60) would need a stack function call depth ~ 1e18.
Post by Daniel James
   double fib( int n )
   {
       return (pow( (1+sqrt(5.0))/2, n+1 )
               - pow( (1-sqrt(5.0))/2, n+1 ))
           / sqrt(5.0);
   }
... which is great for fibs of big values, but not fast for small ones.
Why isn't it fast for small fibs?
I seem to recall, if statements are more costly than floating point
function calls, like exponential.
The recursive fob() function has just a single if () statement, so worst
case you'll get one or two branch mispredicts, while the 2^n recursive
calls (for fib(10 or 11)) will probably each take about as long.

An iterative O(n) version will take 2-4 clcok cycles per iteration, so
for n less than about 40, you'd only have ~100 clock cycles to evaluate
the two pow() calls. (I'm assuming the compiler or programmer have
pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as
1/sqrt(5), so that only the two pow() calls remain!)

You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Pancho
2023-12-31 21:19:01 UTC
Permalink
Post by Terje Mathisen
The recursive fob() function has just a single if () statement, so worst
case you'll get one or two branch mispredicts, while the 2^n recursive
calls (for fib(10 or 11)) will probably each take about as long.
How many cycles for a branch mispredict? From real life, I have seen if
statements be more expensve than pow, but it was a long time ago, and
possibly code specific. It surprised me at the time, but ever since I
tend to ignore the cost of functions like pow().
Post by Terje Mathisen
An iterative O(n) version will take 2-4 clcok cycles per iteration, so
for n less than about 40, you'd only have ~100 clock cycles to evaluate
the two pow() calls. (I'm assuming the compiler or programmer have
pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as
1/sqrt(5), so that only the two pow() calls remain!)
You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.
I generally don't worry about optimizing stuff that is very fast.
Premature optimisation is the root of all evil and what not. If you are
really calling it a huge number of times, a lookup is quickest.
MitchAlsup
2024-01-01 20:26:14 UTC
Permalink
Post by Pancho
Post by Terje Mathisen
The recursive fob() function has just a single if () statement, so worst
case you'll get one or two branch mispredicts, while the 2^n recursive
calls (for fib(10 or 11)) will probably each take about as long.
How many cycles for a branch mispredict? From real life, I have seen if
statements be more expensve than pow, but it was a long time ago, and
possibly code specific. It surprised me at the time, but ever since I
tend to ignore the cost of functions like pow().
Long latency branch mispredicts are often in the 13-cycle range. So if you
are issuing 2 instructions per cycle into the pipeline, you may be throwing
~40 instructions per mispredict. For example LD-CMP-BR where the LD misses
L1 but hits in L2.
MitchAlsup
2024-01-01 20:22:23 UTC
Permalink
Post by Terje Mathisen
Post by Pancho
Post by Daniel James
   double fib( int n )
   {
       return (pow( (1+sqrt(5.0))/2, n+1 )
               - pow( (1-sqrt(5.0))/2, n+1 ))
           / sqrt(5.0);
   }
... which is great for fibs of big values, but not fast for small ones.
Why isn't it fast for small fibs?
I seem to recall, if statements are more costly than floating point
function calls, like exponential.
The recursive fob() function has just a single if () statement, so worst
case you'll get one or two branch mispredicts, while the 2^n recursive
calls (for fib(10 or 11)) will probably each take about as long.
An iterative O(n) version will take 2-4 clcok cycles per iteration, so
for n less than about 40, you'd only have ~100 clock cycles to evaluate
the two pow() calls.
POW is 39 cycles (unpipelined) in My 66000 ISA.
Post by Terje Mathisen
(I'm assuming the compiler or programmer have
pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as
1/sqrt(5), so that only the two pow() calls remain!)
Plus a FSUB and FDIV.
Post by Terje Mathisen
You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.
The Ln2() is 14, the exp2() is another 14, then there is the 90-bit×64-bit
multiply 5-cycles and pipeline latency of 5 cycles (and I seem to be missing
a cycle)
Post by Terje Mathisen
Terje
Pancho
2023-12-31 21:04:03 UTC
Permalink
Post by Pancho
I
haven't tested recently, but off the top of my head, the maximum
function call depth would be something like 300, so fib(10) would be a
stack overflow. fib(60) would need a stack function call depth ~ 1e18.
Sorry this was a mistake, the function call depth is only ~n not 2^n.
For fib(n), there are ~2^n function calls, but not 2^n deep on the stack.
Daniel James
2024-01-01 15:00:28 UTC
Permalink
Post by Daniel James
Post by Daniel James
The naive doubly recursive solution has the huge merit of being
human-readable.
Yes, but you'd be surprised how quickly double recursion goes bad. I
haven't tested recently, but off the top of my head, the maximum
function call depth would be something like 300, so fib(10) would be
a stack overflow. fib(60) would need a stack function call depth ~
1e18.
On 64-bit gcc here on Debian (so, 64-bit ints) I can call fib(50)
without blowing the stack, but it takes minutes (on a Ryzen 7).

However, the (signed) 64-bit integer result overflows for fib(n) where
n>45, which is more of a problem.
Post by Daniel James
Post by Daniel James
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
... which is great for fibs of big values, but not fast for small ones.
Why isn't it fast for small fibs?
Gut feeling. I haven't timed it, and it may be.

I mean ... obviously it will be as fast for small ones as for large, but
for small ones it won't beat an iterative calculation and for very small
ones I doubt it will beat the naive recursive method.

My point, though, was that clear code is often more important than fast
execution, and while the naive recursive method for fibs gets horribly
slow for large values even it is acceptable for small ones.
Post by Daniel James
I seem to recall, if statements are more costly than floating point
function calls, like exponential.
if statements are costly if they invalidate instruction prefetch and/or
jump to code that isn't in the cache ... but optimizers and CPUs are
pretty good at branch prediction and cacheing.
--
Cheers,
Daniel.
Pancho
2024-01-02 09:23:39 UTC
Permalink
Post by Daniel James
Post by Daniel James
Post by Daniel James
The naive doubly recursive solution has the huge merit of being
human-readable.
Yes, but you'd be surprised how quickly double recursion goes bad. I
haven't tested recently, but off the top of my head, the maximum
function call depth would be something like 300, so fib(10) would be
a stack overflow. fib(60) would need a stack function call depth ~
1e18.
On 64-bit gcc here on Debian (so, 64-bit ints) I can call fib(50)
without blowing the stack, but it takes minutes (on a Ryzen 7).
However, the (signed) 64-bit integer result overflows for fib(n) where
n>45, which is more of a problem.
Yes, I explained that was a brain fart on my part. I remembered the
recursive stack limit was low and confused the number of recursive calls
with stack depth.

The real place I've had recursive stuff break, is tree/graph handling,
where a tree may adopt linear characteristics. There are much better
algorithms than a naive recursive one.
Post by Daniel James
Post by Daniel James
Post by Daniel James
   double fib( int n )
   {
       return (pow( (1+sqrt(5.0))/2, n+1 )
               - pow( (1-sqrt(5.0))/2, n+1 ))
           / sqrt(5.0);
   }
... which is great for fibs of big values, but not fast for small ones.
Why isn't it fast for small fibs?
Gut feeling. I haven't timed it, and it may be.
Yes, what I'm saying is it normally isn't valid. Floating point
functions are fast, and it is rarely something you need to optimise.
Post by Daniel James
I mean ... obviously it will be as fast for small ones as for large, but
for small ones it won't beat an iterative calculation and for very small
ones I doubt it will beat the naive recursive method.
It generally doesn't matter if something very, very fast is slightly
faster than something just very fast. Especially when you also have
very, very slow behaviour elsewhere.
Post by Daniel James
My point, though, was that clear code is often more important than fast
execution, and while the naive recursive method for fibs gets horribly
slow for large values even it is acceptable for small ones.
From a mathematical viewpoint, the closed form solution gives clearer
behaviour. It really depends on what you want, giving it a standard name
is probably the most important thing.
Post by Daniel James
Post by Daniel James
I seem to recall, if statements are more costly than floating point
function calls, like exponential.
if statements are costly if they invalidate instruction prefetch and/or
jump to code that isn't in the cache ... but optimizers and CPUs are
pretty good at branch prediction and cacheing.
I was remembering software doing real life financial calcs, but I'm old,
so it was a long time ago. Pipelines are longer now, but maybe branch
prediction is better?, So I don't know if it is more of a problem or
less of a problem nowadays.

MitchAlsup
2024-01-01 20:16:43 UTC
Permalink
Post by Daniel James
double fib( int n )
{
return (pow( (1+sqrt(5.0))/2, n+1 )
- pow( (1-sqrt(5.0))/2, n+1 ))
/ sqrt(5.0);
}
Looks fast to me::

fib:
ADD R2,R1,#1
POWN R3,#(1+sqrt(5.0))/2,R2
POWN R4,#(1-sqrt(5.0))/2,R2
FADD R2,R3,R4
FDIV R1,R2,#sqrt(5)
RET
Terje Mathisen
2024-01-02 07:16:44 UTC
Permalink
Post by Daniel James
   double fib( int n )
   {
       return (pow( (1+sqrt(5.0))/2, n+1 )
               - pow( (1-sqrt(5.0))/2, n+1 ))
           / sqrt(5.0);
   }
    ADD    R2,R1,#1
    POWN    R3,#(1+sqrt(5.0))/2,R2
    POWN    R4,#(1-sqrt(5.0))/2,R2
    FADD    R2,R3,R4
    FDIV    R1,R2,#sqrt(5)
    RET
The final FDIV should be a FMUL by (1.0/sqrt(5.0))

Probably followed by a round to nearest int.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Terje Mathisen
2023-12-31 15:22:39 UTC
Permalink
   long rfib(long x)
     { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }
     ENTER   R29,R0,#0          // just 3 preserved registers.
     CMP     R2,R1,#2
     PGE     R2,TT
     MOV     R1,#1              // return 1
     EXIT    R29,R0,#0
     SUB     R30,R1,#2
     SUB     R1,R1,#1
     CALL    rfib               // rfib(n-1)
     MOV     R29,R1
     MOV     R1,R30
     CALL    rfib               // rfib(n-2)
     ADD     R1,R1,R29
     EXIT    R29,R0,#0          // return rfib(n-1)+rfib(n-2)
But then I moved save and restore to the rfib() calls making the last rung
on the call tree less expensive by 6 memory references each, and by saving
restoring only 2 registers rather than 3 at the cost of a MOV, so each non-
     SUB     R2,R1,#2
     PNLT    R2,TT              // (n-2) < 0
     MOV     R1,#1              // return 1
     RET
     ENTER   R30,R0,#0          // just 2 preserved registers.
     MOV     R30,R2
     SUB     R1,R1,#1
     CALL    rfib               // rfib(n-1)
     MOV     R2,R1
     MOV     R1,R30
     MOV     R30,R2
     CALL    rfib               // rfib(n-2)
     ADD     R1,R1,R30
     EXIT    R30,R0,#0          // return rfib(n-1)+rfib(n-2)
But then I realized that rfib(n-2) has already been computed by rfib(n-1).
Since the size of the redundant element on the stack is constant, rfib(n-1)
makes a location available to rfib(n-2) so only 1 call is required instead
That is the huge/obvious/sneaky (take your pick!) Fib optimization,
since you are effectively getting rid of the O(2^n) exponential
recursion fanout, making it O(n) instead.

The step from that to tail recursion elimination is the only thing
remaining, right?

What you've done is similar to converting

f_n = fib(n)

into

(f_n,f_n_minus_1) = fib2(n)
{
if (n < 2) return (1,1);
(f2,f1) = fib2(n-1);
return (f1+f2,f1);
}

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Andy Burns
2023-12-29 21:30:45 UTC
Permalink
Post by Vir Campestris
The code I have put in a signature block; there's no point in risking
someone posting it again. I've commented it, but no doubt not in all the
right places! I'd be interested to know what results other people get.
Built and run under Win11 with g++ 13.2.0
on a laptop with
i7-11370H CPU (3.3GHz base, 4.28GHz turbo)
L1 d-cache 48kB per core
L2 cache 1.2MB
L3 cache 12MB
16GB memory

Size 1 gave 5.12047e+09 bytes/second.
Size 2 gave 5.76467e+09 bytes/second.
Size 4 gave 5.78979e+09 bytes/second.
Size 8 gave 5.6642e+09 bytes/second.
Size 16 gave 5.55303e+09 bytes/second.
Size 32 gave 4.4151e+09 bytes/second.
Size 64 gave 4.67783e+09 bytes/second.
Size 128 gave 4.85115e+09 bytes/second.
Size 256 gave 4.88147e+09 bytes/second.
Size 512 gave 4.5214e+09 bytes/second.
Size 1024 gave 4.64794e+09 bytes/second.
Size 2048 gave 4.69149e+09 bytes/second.
Size 4096 gave 4.69416e+09 bytes/second.
Size 8192 gave 4.73425e+09 bytes/second.
Size 16384 gave 4.76353e+09 bytes/second.
Size 32768 gave 4.70864e+09 bytes/second.
Size 65536 gave 4.75244e+09 bytes/second.
Size 131072 gave 4.76452e+09 bytes/second.
Size 262144 gave 4.69839e+09 bytes/second.
Size 524288 gave 4.71094e+09 bytes/second.
Size 1048576 gave 4.62922e+09 bytes/second.
Size 2097152 gave 4.55435e+09 bytes/second.
Size 4194304 gave 4.58315e+09 bytes/second.
Size 8388608 gave 4.55917e+09 bytes/second.
Size 16777216 gave 4.60661e+09 bytes/second.
Size 33554432 gave 4.63509e+09 bytes/second.
Size 67108864 gave 4.62349e+09 bytes/second.
Size 134217728 gave 4.62406e+09 bytes/second.
Andy Burns
2023-12-29 22:04:09 UTC
Permalink
Post by Andy Burns
Size 1 gave 5.12047e+09 bytes/second.
Size 2 gave 5.76467e+09 bytes/second.
Size 4 gave 5.78979e+09 bytes/second.
Size 8 gave 5.6642e+09 bytes/second.
Size 16 gave 5.55303e+09 bytes/second.
Size 32 gave 4.4151e+09 bytes/second.
Size 64 gave 4.67783e+09 bytes/second.
Size 128 gave 4.85115e+09 bytes/second.
Size 256 gave 4.88147e+09 bytes/second.
Size 512 gave 4.5214e+09 bytes/second.
Size 1024 gave 4.64794e+09 bytes/second.
Size 2048 gave 4.69149e+09 bytes/second.
Size 4096 gave 4.69416e+09 bytes/second.
Size 8192 gave 4.73425e+09 bytes/second.
Size 16384 gave 4.76353e+09 bytes/second.
Size 32768 gave 4.70864e+09 bytes/second.
Size 65536 gave 4.75244e+09 bytes/second.
Size 131072 gave 4.76452e+09 bytes/second.
Size 262144 gave 4.69839e+09 bytes/second.
Size 524288 gave 4.71094e+09 bytes/second.
Size 1048576 gave 4.62922e+09 bytes/second.
Size 2097152 gave 4.55435e+09 bytes/second.
Size 4194304 gave 4.58315e+09 bytes/second.
Size 8388608 gave 4.55917e+09 bytes/second.
Size 16777216 gave 4.60661e+09 bytes/second.
Size 33554432 gave 4.63509e+09 bytes/second.
Size 67108864 gave 4.62349e+09 bytes/second.
Size 134217728 gave 4.62406e+09 bytes/second.
and (perversely?) with -Ofast

Size 1 gave 4.64966e+09 bytes/second.
Size 2 gave 9.39481e+09 bytes/second.
Size 4 gave 1.59656e+10 bytes/second.
Size 8 gave 2.6789e+10 bytes/second.
Size 16 gave 4.42999e+10 bytes/second.
Size 32 gave 4.51076e+10 bytes/second.
Size 64 gave 4.90103e+10 bytes/second.
Size 128 gave 4.08414e+10 bytes/second.
Size 256 gave 4.36978e+10 bytes/second.
Size 512 gave 4.54119e+10 bytes/second.
Size 1024 gave 4.50594e+10 bytes/second.
Size 2048 gave 4.71622e+10 bytes/second.
Size 4096 gave 4.60261e+10 bytes/second.
Size 8192 gave 3.84764e+10 bytes/second.
Size 16384 gave 3.90878e+10 bytes/second.
Size 32768 gave 3.76718e+10 bytes/second.
Size 65536 gave 3.87339e+10 bytes/second.
Size 131072 gave 3.89058e+10 bytes/second.
Size 262144 gave 3.85662e+10 bytes/second.
Size 524288 gave 4.19209e+10 bytes/second.
Size 1048576 gave 3.00962e+10 bytes/second.
Size 2097152 gave 1.37616e+10 bytes/second.
Size 4194304 gave 1.4136e+10 bytes/second.
Size 8388608 gave 1.39413e+10 bytes/second.
Size 16777216 gave 1.35747e+10 bytes/second.
Size 33554432 gave 1.37286e+10 bytes/second.
Size 67108864 gave 1.38163e+10 bytes/second.
Size 134217728 gave 1.29987e+10 bytes/second.
Andy Burns
2023-12-29 22:07:25 UTC
Permalink
Post by Andy Burns
and (perversely?) with -Ofast
I must look at the exponent, not just the mantissa
I must look at the exponent, not just the mantissa
I must look at the exponent, not just the mantissa
MitchAlsup
2023-12-29 23:59:44 UTC
Permalink
Post by Andy Burns
Post by Andy Burns
Size 1 gave 5.12047e+09 bytes/second.
Size 2 gave 5.76467e+09 bytes/second.
Size 4 gave 5.78979e+09 bytes/second.
Size 8 gave 5.6642e+09 bytes/second.
Size 16 gave 5.55303e+09 bytes/second.
Size 32 gave 4.4151e+09 bytes/second.
Size 64 gave 4.67783e+09 bytes/second.
Size 128 gave 4.85115e+09 bytes/second.
Size 256 gave 4.88147e+09 bytes/second.
Size 512 gave 4.5214e+09 bytes/second.
Size 1024 gave 4.64794e+09 bytes/second.
Size 2048 gave 4.69149e+09 bytes/second.
Size 4096 gave 4.69416e+09 bytes/second.
Size 8192 gave 4.73425e+09 bytes/second.
Size 16384 gave 4.76353e+09 bytes/second.
Size 32768 gave 4.70864e+09 bytes/second.
Size 65536 gave 4.75244e+09 bytes/second.
Size 131072 gave 4.76452e+09 bytes/second.
Size 262144 gave 4.69839e+09 bytes/second.
Size 524288 gave 4.71094e+09 bytes/second.
Size 1048576 gave 4.62922e+09 bytes/second.
Size 2097152 gave 4.55435e+09 bytes/second.
Size 4194304 gave 4.58315e+09 bytes/second.
Size 8388608 gave 4.55917e+09 bytes/second.
Size 16777216 gave 4.60661e+09 bytes/second.
Size 33554432 gave 4.63509e+09 bytes/second.
Size 67108864 gave 4.62349e+09 bytes/second.
Size 134217728 gave 4.62406e+09 bytes/second.
and (perversely?) with -Ofast
{ A
Post by Andy Burns
Size 1 gave 4.64966e+09 bytes/second.
Size 2 gave 9.39481e+09 bytes/second.
Size 4 gave 1.59656e+10 bytes/second.
Size 8 gave 2.6789e+10 bytes/second.
Size 16 gave 4.42999e+10 bytes/second.
} A
{ B
Post by Andy Burns
Size 32 gave 4.51076e+10 bytes/second.
Size 64 gave 4.90103e+10 bytes/second.
Size 128 gave 4.08414e+10 bytes/second.
Size 256 gave 4.36978e+10 bytes/second.
Size 512 gave 4.54119e+10 bytes/second.
Size 1024 gave 4.50594e+10 bytes/second.
Size 2048 gave 4.71622e+10 bytes/second.
Size 4096 gave 4.60261e+10 bytes/second.
} B
{ C
Post by Andy Burns
Size 8192 gave 3.84764e+10 bytes/second.
Size 16384 gave 3.90878e+10 bytes/second.
Size 32768 gave 3.76718e+10 bytes/second.
Size 65536 gave 3.87339e+10 bytes/second.
Size 131072 gave 3.89058e+10 bytes/second.
Size 262144 gave 3.85662e+10 bytes/second.
Size 524288 gave 4.19209e+10 bytes/second.
} C
{ D
Post by Andy Burns
Size 1048576 gave 3.00962e+10 bytes/second.
Size 2097152 gave 1.37616e+10 bytes/second.
Size 4194304 gave 1.4136e+10 bytes/second.
Size 8388608 gave 1.39413e+10 bytes/second.
Size 16777216 gave 1.35747e+10 bytes/second.
Size 33554432 gave 1.37286e+10 bytes/second.
Size 67108864 gave 1.38163e+10 bytes/second.
Size 134217728 gave 1.29987e+10 bytes/second.
} D



A displays BW follows access size

B displays essentially the same performance throughout

C displays a minor step down in performance due to L2 or TLB effects 5 to 4 ratio

D displays a major stepdown in performance at DIV 3 compared to C 3 to 1 ratio
Vir Campestris
2023-12-31 16:52:58 UTC
Permalink
Post by Andy Burns
and (perversely?) with -Ofast
Thank you for those. It looks as though there are some other effects
going on apart from the cache size - but I can clearly see the drop when
you run out of cache. Within the caches? Too much noise for me to
understand!

Andy
MitchAlsup
2023-12-29 22:02:36 UTC
Permalink
Post by Andy Burns
Post by Vir Campestris
The code I have put in a signature block; there's no point in risking
someone posting it again. I've commented it, but no doubt not in all the
right places! I'd be interested to know what results other people get.
Built and run under Win11 with g++ 13.2.0
on a laptop with
i7-11370H CPU (3.3GHz base, 4.28GHz turbo)
L1 d-cache 48kB per core
L2 cache 1.2MB
L3 cache 12MB
16GB memory
Size 1 gave 5.12047e+09 bytes/second.
Size 2 gave 5.76467e+09 bytes/second.
Size 4 gave 5.78979e+09 bytes/second.
Size 8 gave 5.6642e+09 bytes/second.
Size 16 gave 5.55303e+09 bytes/second.
Size 32 gave 4.4151e+09 bytes/second.
Size 64 gave 4.67783e+09 bytes/second.
Size 128 gave 4.85115e+09 bytes/second.
Size 256 gave 4.88147e+09 bytes/second.
Size 512 gave 4.5214e+09 bytes/second.
Size 1024 gave 4.64794e+09 bytes/second.
Size 2048 gave 4.69149e+09 bytes/second.
Size 4096 gave 4.69416e+09 bytes/second.
Size 8192 gave 4.73425e+09 bytes/second.
Size 16384 gave 4.76353e+09 bytes/second.
Size 32768 gave 4.70864e+09 bytes/second.
Size 65536 gave 4.75244e+09 bytes/second.
Size 131072 gave 4.76452e+09 bytes/second.
Size 262144 gave 4.69839e+09 bytes/second.
Size 524288 gave 4.71094e+09 bytes/second.
Size 1048576 gave 4.62922e+09 bytes/second.
Size 2097152 gave 4.55435e+09 bytes/second.
Size 4194304 gave 4.58315e+09 bytes/second.
Size 8388608 gave 4.55917e+09 bytes/second.
Size 16777216 gave 4.60661e+09 bytes/second.
Size 33554432 gave 4.63509e+09 bytes/second.
Size 67108864 gave 4.62349e+09 bytes/second.
Size 134217728 gave 4.62406e+09 bytes/second.
This data set has a 5/4 ratio best to worst with noise.
Andy Burns
2023-12-22 08:43:52 UTC
Permalink
Post by Vir Campestris
Series length 8B is about the same as the 56 to 128 speed. Series length
16B is a bit less. Series length 32 is a lot less. Not as slow as main
memory, but not much more than half the peak speed. My next step up is
the peak speed. Series length 144 to 448 is slower still - slower in
fact than the main memory speed.
WTF?
Apparently my brain thought it would be "fun" to imagine its way through
all the reads and writes running back and forth between levels of cache
and ain memory ... it didn't stop me sleeping, but it didn't reach any
conclusion either!
Loading...