🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

arithmetic overflow

Started by
12 comments, last by JoeJ 3 years, 5 months ago

frob said:

JoeJ said:
I am sure 32 bits are enough for my data, so why using 64? On some (ore any) platform this could affect performance?

You are quite likely paying the cost of conversions that you didn't intend. It is faster to keep them all as the same type and do the math than it is to do the conversions.

Sometimes those conversions you pay for happen inside the CPU core itself. Modern processors have moved to 64-bit processing internally so many of the 32-bit, 16-bit, and 8-bit operations are generally slower. For many operations they do the work as though it were a 64-bit value sometimes having to extend the value out, do the math, and then chop it back down. Benchmark tools like PassMark PerformanceTest have found notable differences on many CPUs which they have occasionally discussed in their forums. For example, 64-bit integer multiplies may be 6x faster than the 32-bit integer operations, or 64-bit division might be 4x faster than the same 32-bit integer division. Exact numbers depend on the chips and the operations involved.

None of this is new. The performance difference was noted about 15 years ago when Intel and AMD moved to 64-bit cpu cores. They continue to focus first on 64-bit (and with modern extensions also 128-bit, 256-bit, and even 512-bit) optimizations and leave the old, smaller cases behind as historical footnotes. The engineers might have left in faster hardware designs specific to the 32-bit or 16-bit operations, but quite often they don't bother and simply do the work at 64-bit (or higher) precision.

If your data is 32-bit use a 32-bit type like u32 or i32 or int32_t or whatever you want in your system. If it is 64 bit do likewise, using the actual size of the data. If you're working with memory sizes then size_t is the proper data type, not int or int64_t or int32_t, or other variations. Don't intentionally do conversions you don't need.

I am very sceptical of this claim that 64-bit ops could be (that much) faster than 32-bit ops, so I tried to find the source and I think this is where it comes from: https://forums.passmark.com/performancetest/3278-amd-llano-cpumark-artificially-inflated-scores?postcount=7#post18109

And it just seems to be the usual mixup of 64-bit architecture and 64-bit data. When they're talking about 32bit and 64bit in that thread it's about architecture, without specifying what datatypes are used - just “Integer Math” and “Find Prime Numbers”. A bit more reading ( https://forums.passmark.com/performancetest/3383-64bit-vs-32bit-benchmarks-integer-maths-pt8?t=3348​ ) shows that both tests mixed 32-bit and 64-bit integers and performance numbers where dominated by integer division and square roots respectively. So it's not that 32-bit ops are slow(er) on modern x64 CPUs, it's that 64-bit integer operations are slow when you have to emulate them with x86 instructions.

And then coincidentally the specific chips in that original benchmark also had a hardware bug, with a workaround that completely ruined their integer division performance: https://forums.passmark.com/performancetest/3705-amd-llano-a-series-benchmark-and-cpu-bug?t=3656

And if you want to look up exact performance numbers for individual instructions on specific processor architectures, Agner Fog has you covered: https://www.agner.org/optimize/instruction_tables.pdf

Advertisement

I think we can also take a look at the “int_fast”-types defined in <cstdint>. On pretty much all modern compilers/systes that I checked, fast8_t is defined as int8, fast16_t as int32, fast32_t as int32 and fast64_t as int64_t. That seems to suggest that anything outside of 16-bit integeres is going to be pretty efficient for the processor to deal with.

Eternal said:
I am very sceptical of this claim that 64-bit ops could be (that much) faster than 32-bit ops

It's something i'll keep an eye on over time. I never compared 32 vs 64 bit integer type perf. before.

Yesterday i made some fake poisson samples generator, and i thought this would be a nice test, because it has a lot index math.
Basically, for each sample it iterates 27 * 27cells, so that's expensive.

The test was very interesting. At first i forgot to convert also the randF function, because it used uint32_t instead int.
With this, performance was exact the same with replacing int with either int32t or int64t.
Then, after changing this function too, i got a 10% speedup. : )

Seems the 64 → 32u truncation had a noticeable cost, and using 64 bit is free. Both is surprising for me.

namespace FAKE_POISSON
{
	constexpr float radius = 1.f;
	constexpr float clip = 0.498f;

	//using intT = int32_t; // 478 ms
	using intT = int64_t; // also 478 ms at first, 438 ms after making also randF 64 bit

	inline float randF (intT i)
	{
		i *= 1664525;
		i ^= (i << 16);
		i ^= (i >> 16);

		i *= 16807;
		i ^= (i << 8);
		i ^= (i >> 24);

		i *= 100069;
		i ^= (i << 24);
		i ^= (i >> 8);

		i *= 1000099;

		return float(i) / float(0x100000000LL);
	}


	template <class Vec3>
	Vec3 RandomSampleInCell (const intT x, const intT y, const intT z)
	{
		intT h = (x*23879 + y*96823 + z*182177) * 3;
		return Vec3 ( randF(h)-.5f, randF(h+1)-.5f, randF(h+2)-.5f );
	}

	template <class Vec3>
	Vec3 RelaxedSampleInCell (const intT x, const intT y, const intT z)
	{
		Vec3 curP = RandomSampleInCell<Vec3> (x, y, z);

		Vec3 dispSum (0);
		float wSum = FP_TINY;
		for (intT x2=-1; x2<=1; x2++)
		for (intT y2=-1; y2<=1; y2++)
		for (intT z2=-1; z2<=1; z2++)
		{
			Vec3 adjP = RandomSampleInCell<Vec3> (x+x2, y+y2, z+z2) + Vec3(x2,y2,z2);
			Vec3 diff = curP - adjP;
			float l = diff.SqL();
			if (l<radius*radius && l>FP_EPSILON2)
			{
				l = sqrt(l);
				Vec3 disp = diff / l * (radius-l);
				dispSum += disp;
				wSum += 1.f;
			}			
		}			
		curP += dispSum / (wSum*2);
		for (intT i=0; i<3; i++)
		{
			curP[i] = max(-clip, curP[i]);
			curP[i] = min( clip, curP[i]);
		}

		return curP;
	}

	template <class Vec3>
	Vec3 RelaxedTwiceSampleInCell (const intT x, const intT y, const intT z)
	{
		Vec3 curP = RelaxedSampleInCell<Vec3> (x, y, z);

		Vec3 dispSum (0);
		float wSum = FP_TINY;
		for (intT x2=-1; x2<=1; x2++)
		for (intT y2=-1; y2<=1; y2++)
		for (intT z2=-1; z2<=1; z2++)
		{
			Vec3 adjP = RelaxedSampleInCell<Vec3> (x+x2, y+y2, z+z2) + Vec3(x2,y2,z2);
			Vec3 diff = curP - adjP;
			float l = diff.SqL();
			if (l<radius*radius && l>FP_EPSILON2)
			{
				l = sqrt(l);
				Vec3 disp = diff / l * (radius-l);
				dispSum += disp;
				wSum += 1.f;
			}		
		}			
		curP += dispSum / (wSum*2);
		for (intT i=0; i<3; i++)
		{
			curP[i] = max(-clip, curP[i]);
			curP[i] = min( clip, curP[i]);
		}

		return curP;
	}
}

This topic is closed to new replies.

Advertisement