🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

C++ swizzle selector like GLSL?

Started by
19 comments, last by Juliean 2 years, 6 months ago

I've defined our math classes like so

typedef union Vector4
    {
        public:
            struct _Fields
            {
                public:
                    /**
                     This vector's X coordinate
                    */
                    float X;
                    /**
                     This vector's Y coordinate
                    */
                    float Y;
                    /**
                     This vector's Z coordinate
                    */
                    float Z;
                    /**
                     This vector's W coordinate
                    */
                    float W;
            };

            float Value[4];
            _Fields Fields;

            force_inline float X()
            {
                return Fields.X;
            }

            force_inline float X() const
            {
                return Fields.X;
            }

            force_inline void X(float value)
            {
                Fields.X = value;
            }
            ...
            
    } vec4;

You can however also add an additional field for the 3-component vector if you like, the union should do the trick

Advertisement

Even if the wrapper struct gets optimized away, a large cost remains.

On graphics cards it is a hardware operation, that's part of the appeal. Swizzle and mask operations are effectively free since they were introduced.

On the CPU when “optimized away” the wrappers and completely inlined all the function overhead it is still likely taking 4 floats or 4 pointers in registers, which in turn will affect register coloring and available space plus the cost of indirection is still present in actual processing. Not optimized away it's a variable also taking at least that much space, plus the overhead of the function prolog and epilogs, plus more due to allocations along with the time and space to allocate and destroy.

If you absolutely need it then you'll pay the cost anyway, but know that the cost exists when you start processing long arrays of vertices, point clouds, or point arrays. As the structures can often be processed with SIMD operations for the entire array, in a more serious project you'll likely end up rewriting it for bulk processing anyway. Individual processing one item at a time is quite inefficient.

frob said:
On the CPU when “optimized away” the wrappers and completely inlined all the function overhead it is still likely taking 4 floats or 4 pointers in registers, which in turn will affect register coloring and available space plus the cost of indirection is still present in actual processing. Not optimized away it's a variable also taking at least that much space, plus the overhead of the function prolog and epilogs, plus more due to allocations along with the time and space to allocate and destroy.

That isn't necessarily true. As long as the compiler can inline all functions of the wrapper, then it can reverse-track the references and optimize the storage entirely, by replacing the operation that is done on the wrapper with what the reference points to. I made an simplistic example in Godbolt that shows that the generated assembly could be absolutely identical, from instructions to which registeres are used, as if the wrapper doesn't exist at all, but I lost it overnight (but if you don't belive me I'll recreate it). Now this does depend on the wrapper being created locally in a function, as well as the source of the swizzling being non-conditional, but its possible for this to be 100% free. Though I wouldn't have been able to say whether this is 100% true without looking at the generated assembly, so a bit of caution and measuring here is definately warranted.

#ifndef main_threadH

#define main_threadH

#include <vector>

#include <sys/stat.h>

#include <stdio.h>

#include <cstdlib>

#include <iostream>

#include <string.h>

#include <fstream>

#include <dirent.h>

using namespace std;

struct vec1

{

float x;

float & r;

vec1(float & a) : r(a) { }

vec1() : r(x)

{

//this->

x = 0.0;

// r = x;

}

};

struct vec2 : public vec1

{

};

int main()

{

vec1 p;

p.r = 5;

cout << p.x << endl;

p.x = 10.5;

cout << p.r << endl;

}

#endif

Speaking about swizzling and C/C++ there is a way of doing that properly through SSE/AVX, although ends up in SHUFPS instructions (which is kind of expected). A quick mockup (simplified as much as possible) is here - https://github.com/Zgragselus/Swizzling​ - I took the liberty of stripping it down to the least possible size to allow you for swizzling (could have been gist, but it'd be too long for one).

General idea is - use templates, struct members inside union and shuffle instruction to achieve so. By using struct members in union you get rid of the parentheses at the end (so you don't need to call it like vec.wzyx(), but instead use vec.wzyx), much like the parentheses version it will end up being compiled in exactly same code with optimizations (bear in mind that without optimizations - in Debug - this is going to end up in doing stdcalls, which is going to be quite slow) - resulting in few mov (ideally movaps) instructions, and shuffle instruction (possibly for avx this can be optimized to be permute instruction instead of shuffle). Doing it this way, in Release you won't need the stdcall (fastcall or whatever call convention you use) magic, it will end up being optimized to movs and shuf/perm - so you will avoid performance hit @frob was mentioning with functions.

This being said: you will still end up with shuf/perm instructions even when compiler perfectly optimizes everything out (which may not happen in some cases!), and as mentioned - heavy computing is going to be done most likely directly with intrinsics or with batch-processing in mind (treating it SoA vs. AoS depending on resulting speed).

I personally used similar implementation in the past, but it really was overkill - I could use swizzling, but I never actually did in the real code. It's a nice code, but it may end up being dead code aswell. Currently I use just standard sse-based implementation - without any dark magic like swizzling. In heavy weight applications like software ray tracers or rasterizers I have used mostly intrinsics directly (or tiny wrappers) and avoided unnecessary instructions to increase performance. Every instruction counts, especially when you do it per-processed-pixel.

I believe there is like GLSL and HLSL C++ library available allowing you to do so.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

a light breeze said:
Only if the optimizer doesn't optimize the entire wrapper struct away.

It can't possibly do this, because it has storage, and affects aliasing.

enum Bool { True, False, FileNotFound };

hplus0603 said:
It can't possibly do this, because it has storage, and affects aliasing.

Yes, it absolutely can: https://godbolt.org/z/39KMKP1MM

The function multiplying the vec3 directly, and the one doing the multiply via the wrapper are identical in the generated assembly under O3. As long as you don't store the struct in another class (which, why would you do that when your main goal is to be able to do vec.yz *= 2), optimizing away references and structs containing those is something that a compiler is very capable of doing.

That's solving the wrong problem, really. Quoting myself:

frob said:
the cost exists when you start processing long arrays of vertices, point clouds, or point arrays. As the structures can often be processed with SIMD operations for the entire array, in a more serious project you'll likely end up rewriting it for bulk processing anyway. Individual processing one item at a time is quite inefficient.

If you're processing serious arrays in the same form as HLSL processing millions of entries each frame, but approaching it in a per-point element processing format like this, well … good luck on that.

@Juliean At that point, you need to construct the swizzlable instance temporarily, e g, you can't just take them as reference arguments or store them as members, so the amount of typing goes up. And in your case, you still have to make the function inline, intrusively, so there's really no gain on top of just making them return the real vec3.

If you don't want the member functions, you can also write free functions that do the same. After trying many different approaches, that's what I found was the best choice. Free functions have the benefit of not being intrusive, and also if you make them templates, you can re-use the same implementation on multiple different math libraries (D3D vs physics vs animation package vs …)

template<typename T>
T zyx(T const &t) {
    return T{t.z, t.y, t.x};
}

enum Bool { True, False, FileNotFound };

This topic is closed to new replies.

Advertisement