🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Concurrent modification of components in ECS

Started by
12 comments, last by ballzak 4 years, 5 months ago

I have little game dev experience but trying to create an ECS based game engine for personal use. Components are stored in, and managed by, separate Systems, as i believe it's good SRP, and the most performant since each System can choose a memory layout optimal for its purpose, e.g. sorted, stored in GPU, etc. The engineering issue i've been struggling with is how to create/update/delete Components in such a design, preferably allowing good parallelization. I've read every article/post i can find, none seem to present a full or convincing solution. Conceivable methods:

1. Direct modification, synchronous message passing

A System just calls the function to directly create/update/delete a Component in another System, or itself. Simple, but would make parallelization nearly pointless since every Component data structure would require locking. No batching, if Components must maintain order, or some other post-processing, sorting after every call is sub-optimal. To avoid locking, Systems could possibly be orchestrated do their processing in an order to not modify each other concurrently, but as the number of Systems increase i fear it would become a maintainable nightmare, also limiting parallelization.

2. Single “global” Event queue, asynchronous message passing

A System will enqueue Events, such as EnemySpawned or WeaponFired, on a single “global” queue. All Systems would then process the queue when they can do so without interference, possibly in parallel, to create/update/delete their Components, e.g. PositionSystem create its PositionComponent for the Event, RenderSystem its MeshComponent, etc.. I'd expect better performance since less locking is required, only on the queue itself. Batching would be possible, but every System would have to process lots of irrelevant Events. This also seems illogical to me, a RenderSystem shouldn't have to understand high-level concepts such as Enemy nor Weapon, and as the number of Events increase so does “listeners” in every interested System.

2.1. Each Event type has a “global” queue

Similar to #2 except every type of Event has its own global queue. This could improve performance as interested Systems wouldn't have to process irrelevant Events, less queue lock contention. Downside is that there's could possibly be an unmanageable amount of queues floating around.

2.2. Each Event type has a “global” queue for every worker thread

Same as #2.1 except every worker thread has its own queue, so they don't need any locking, possibly improving performance. Downside, even more queues.

3. Single “global” Command queue, asynchronous message passing

Similar to #2, but instead of Events a System enqueue specific Commands, e.g. CreatePositionComponent handled by PositionSystem, CreateMeshComponent handled by RenderSystem, etc.. Possible improvements over #2 are that RenderSystem no longer has to understand high-level concepts, and that every System probably only need three “listeners” (create/update/delete). Possible downside is that spawning an (Enemy) entity could create a lot of Commands.

3.1. Each Command type has a “global” queue

Similar to #3 except every type of Command has its own global queue, so there's less queue lock contention. Beyond that i don't see any improvement since a System wouldn't be able to perform create/delete in a single batch.

3.2. Each Command type has a "global" queue for every worker thread

Same as #3.1 except every worker thread has its own queue, so it doesn't need for locking, possibly improving performance. Downside, more queues.

4. Each System has a single Command queue

Similar to #3 except every System has its own Command queue. This could improve performance since the System wouldn't have to process any irrelevant Commands meant for other Systems. It could perform create/delete in a single batch, and it would also be possible, since the System owns, or take ownership of, the queue, to sort the Commands in an order optimal for processing.

4.1. Each System has a Command queue for every worker thread

Same as #4 except every worker thread has it's own queue, so they don't need any locking, possibly improving performance. This is how Vulkan does it, so i'd expect it's the preferable solution.

5. (Theoretical) task scheduler with System dependency graph

Some AAA seem to use such a design, i have no idea of the internals but at lower-level i guess similar to #4.1 except it enqueue a Command handler function, co-routine or fiber instead of just Command data.

EDIT:

6. Double-buffered Component data

Each Systems has two, or X number of frames, copies/sets of their Component data, then swap after each frame. Read from current set, write to next/future set. More of simulation/logic data requirement i guess, since i don't seem how this would reduce locking when concurrently creating/deleting Components in other System, an Event/Command buffer would still be required for that. Copying all Component data every frame could be costly.

Any other solutions?

What's your preference, why?

Advertisement

To avoid most issues you mentioned, I have deployed multi-threading only in serious bottle-neck routines (CPU time>5%).

Now, with only a few problematic locations left in code, I can solve them on case-by-case basis:-

  • For update, I throw responsibility to user. User has to split thread (if want) and use mutex himself as necessary.
// I believe my game engine works like OpenMP
// all of these lives in user code (not engine code)
auto arr= getAllComponent<Com_Ship>(); 
#pragma omp parallel for
for (int i = 0; i < arr.size(); i++) {// split into 4 threads
   arr[i]→x=arr[i]→vx*timeStepSecond; // user need to call std::mutex::lock() himself if need
}
  • Engine always uses global mutex when create/delete entity/component, but in smallest scope as possible.
  • Engine supports only very simple callback (messaging), thus avoid most issues again.
  • If I really need to support a potential race condition messaging (e.g. physic engine), I will make it single-threaded, OR let user pass parameter whether the system should multi-thread it.

For a simple task scheduler, I use a global one. My code is modified from a stackoverflow link (C++).

It is probably not the best approach, but it works for me.

You probably want to write game first. Then, solve various real multi-thread issues in your game. You will find that this way is easier than try to solve theoretical imaginary ones.

Hope it helps. (I am not native speaker)

One nice trick is to have all the event queues buffer events till the next frame. If modifications all take place on the subsequent simulation tick (not the current one), then you have to deal with substantially less locking.

If you want to take this concept really far, you can double buffer the entire simulation. Duplicate the data, all reads go to the current frame, all writes go to the future frame, swap the buffers when the frame completes.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

swiftcoder said:
One nice trick is to have all the event queues buffer events till the next frame. If modifications all take place on the subsequent simulation tick (not the current one), then you have to deal with substantially less locking.

Agree. The Event/Command buffers would have to be processed somewhere, either last of every frame, or at the start of subsequent frame.

swiftcoder said:
If you want to take this concept really far, you can double buffer the entire simulation. Duplicate the data, all reads go to the current frame, all writes go to the future frame, swap the buffers when the frame completes.

I forgot to add that solution. I'll still need Event/Command buffers for parallelized System writes. I'd expect duplicating all Component data every frame could be costly, but some Systems, like PhysicsSystem, will probably need it anyway.

ballzak said:
I'd expect duplicating all Component data every frame could be costly, but some Systems, like PhysicsSystem, will probably need it anyway.

Quite a lot of systems turn out to need it if you decided to decouple your frame rate from your simulation rate, or add networking with prediction/interpolation.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

hyyou said:
You probably want to write game first. Then, solve various real multi-thread issues in your game. You will find that this way is easier than try to solve theoretical imaginary ones.

Agreed, if making a game was the goal. But in that case, using an existing engine would likely be the proper answer. For now the engine is the priority, and this fundamental design decision would affect so many aspects of it. Performance expectations at this point are, of course, theoretical, but the cost of locking is not imaginary. I've already made some test Systems, but when i start integrate, parallelize them, add async resource loading, etc., the current non-design begins to fall apart.

swiftcoder said:

ballzak said:
I'd expect duplicating all Component data every frame could be costly, but some Systems, like PhysicsSystem, will probably need it anyway.

Quite a lot of systems turn out to need it if you decided to decouple your frame rate from your simulation rate, or add networking with prediction/interpolation.

Thanks for the head-up, implementing it later probably isn't much of an issue. However, answer to the question of a good solution for creating/deleting Components concurrently do remain, Event vs Command queues?

So, the first question here, related to “write the game first”, is what goal do you have? So, for instance if you are writing an FPS with a few thousand entities, I would tend to question using an ECS in general, much less trying to parallelize it. But, if you are looking to do a huge simulation and be ready for the continued increases in core counts, that's a whole different story and the one that I've been working on myself. So, assuming massive simulations and current/future CPU's, the fun begins.

First, what is the problem with ECS and parallel execution? Quite simply, data ownership needs to be explicit and controlled. Both things which are not common in games since threading really is not something a lot of folks architect for. This is basically just a multiple reader, single writer situation except that you must explicitly control when a system is allowed to run in order to enforce the rules. Otherwise you end up in the situation you point out where you have to lock each individual component, a totally evil situation and one which almost guarantee's just running single threaded would be better.

Solving the read/write and ordering, plus a few less common issues, the solution I ended up using after several variations is a mix of your #1 and #5 proposed solutions. Basically, #5 consists of building a directed acyclic graph via constraint resolution, i.e. don't run a writer and a reader of a given component at the same time. One of my performance tests is a rather complicated flocking system and it works out as a good example case of the various threading issues. You have the following components and systems:

Components: game time, position, velocity, orientation, force, spatial cell, spatial, render mesh
Systems: time update, update spatial, separation force, alignment force, cohesion force, apply force, render dispatch

In my solution components can be shared (i.e. a single instance) or per entity which is the normal style in various ECS's. “game time” and “spatial” are shared components and as such the “time update” and “update spatial” systems are both single function calls. Since so many things use the “game time” in a real game (i.e. it contains delta time), we need to make sure the dependencies are defined. As such, here are two lines from my configuration system:

“timer update”, “game time”, {"w:game time"}, {}
“apply force”, “apply forces”, {"r:force", w:position", “w:velocity”, “w:orientation”, “w:spatial cell”}, {"game time"}

So, the first column of data is the type of the system we are installing. My systems use a two stage solution, first you register libraries of system types, this does not make them execute it just makes them available for use. After registration you can install instances of systems as seen here. The purpose is that I can make reusable libraries of systems. The second column is the name of the system instance to be created as a reference name for dependencies. Third is the list of components being accessed with a permissions modifier: ‘r’ead, ‘w’rite or ‘u’nsafe. Finally is the list of dependencies on other systems, in this case “apply forces” depends on “game time” having been run.

Anyway, if you put this into the constraints solver, you would simply end up with two nodes: “game time” → “apply forces” Which to my execution pipeline means that “game time” will run and complete prior to “apply forces” starting to run. This behavior is critical to making workable parallel execution, even though this is a very simple example so far. Basically the rules to build the DAG enforce thread safety such that when writing 90+% of systems, you don't have to worry about any form of thread safety because the system has proper memory behavior while running. Now, add the remaining systems and dependencies:

“timer update”, “game time”, {"w:game time"}, {}
“update spatial”, “update spatial”, {"u:spatial", “r:spatial cell”}, {}
“separation force”, “separation”, {"w:force", “r:spatial”, “r:position"}, {"update spatial"}
“alignment force”, “alignment”, {"w:force", “r:spatial”, “r:orientation”}, {"update spatial"}
“cohesion force”, “cohesion”, {"w:force", “r:spatial”, “r:position”}, {"update spatial"}
“apply force”, “apply forces”, {"r:force", "w:position", “w:velocity”, “w:orientation”, “w:spatial cell”}, {"game time", “separation”, “alignment”, “cohesion”}
“render dispatch”, “render”, {"r:position", “r:orientation”, “r:render mesh”}, {"apply forces"}

With this more complete configuration, you end up with a DAG solution as follows:

“game time”, “update spatial” → “separation force” → “alignment force” → “cohesion force” → “apply force” → “render dispatch”.

Basically all this is saying is that we can run “game time” and “update spatial” in parallel and everything else must be serial. This type of graph result is fairly standard in real game code which is why I did not end up doing things the way most solutions I've seen handle this. Basically most ECS implementations which attempt to run in parallel end up using a work stealing job system (wsjs from now on) and given the above, it's going to have no parallelism because each system is a job and each job will depend on the prior job. If you attempt to solve this by issuing individual jobs for each entity in each system, well Mr Amdahl will show up and kick you in the butt. As minimal as the overhead is to jobs in a WSJS, it still adds up such that in a setup such as this you will waste more time in managing the WSJS than you will executing the actual jobs. WSJS is a trap for anything trying to scale past 8ish cores, don't get bit by it.

Rather than using a WSJS, I have a thread management system where you can execute multiple different threading models at the same time. For 95+% of all needs, I only need one threading model, which hardly justifies the term “model”: parallel for loops. I.e. I just feed all the entities into each system as a simple parallel for and get my parallel execution in that manner. There is no per job overhead to speak of and as such the parallelism of the above is phenomenal. To give you an idea, even with the bandwidth heavy nature of spatial searches (unless you cheat which is not what I'm doing, this is supposed to be realistic to normal game AI) to compute the forces, it distributes out on a 2990wx with 64 threads such that a normal performance test run starts with 50,000 boids or more while maintaining >200Hz. The total number of execution barriers (i.e. locks) in the above execution pipeline is 5 times the number of threads minus one in use, so extremely light locking for massive parallelism.

Final note about this. Adding/removing entities is a massive pain in the butt for most engines no matter what the backing storage systems are. I personally use a staging solution myself such that adding an entity goes into a temporary storage buffer until all systems have finished executing and then I run a single threaded cleanup once a frame to move them into component archetypes where the actually belong. This means they don't attempt to insert themselves into things like the spatial awareness table, a running system etc immediately and won't show up as entities till the next frame. For most intents and purposes you want this functionality, don't attempt to insert them directly as that becomes a source of non-determinism in any threaded solution.

Building such an architecture is not for the faint of heart and I do NOT suggest it for 99% of folks. I actually needed my solution for some work I'm doing, but most folks just don't need this crazy level of performance and it would be a waste. But, given the CPU's available today and specifically the gamer targeted 3900x series AMD's along with the future looking like more cores are on the way across all form factors, at least understanding how this functions is one possible and semi-proven way to solve the problem. Add a bit of component shadowing for decoupled simulation and rendering, networking and/or fixed deterministic execution and you too can actually fully utilize a 2990wx (or 3990, or future goodness).. ?

@All8Up Excellent information. Massive simulations (in games) are the most fun indeed. Some clarifications and followup questions…

So you directly modify existing Components (method #1), but enqueue Commands for creating and deleting Components (method #4.1)?

The DAG of Systems is only used for sorting them (the pipeline), in an order which avoids concurrent writes to other System/Component, or is it used for anything else, like where to use locks, or suspending tasks in the job scheduler?

Submitting a job for every Entity surely doesn't make much sense, but a scheduler could implement the parallel_for, and be used for other things like async file I/O (resource loading).

Do you put the barrier after/lock around the whole System, or some more granular approach?

So you directly modify existing Components (method #1), but enqueue Commands for creating and deleting Components (method #4.1)?

Not exactly. I make the actual entities and their data in a staged storage area so you can initialize their components and do other things as required. Since I use an archetype like organization of component storage at the end of the frame the cleanup is just a matter of finding which components exist on the new entity and then moving it to the appropriate archetype. In this way, the entity creation uses the same API's and there are no special behaviors involved, the only sneaky catch-22 is that you can't look up the entity in other threads until the next frame, but since they should not participate in the current world update anyway, there is no logical way that another thread would know they exist till later anyway.

The DAG of Systems is only used for sorting them (the pipeline), in an order which avoids concurrent writes to other System/Component, or is it used for anything else, like where to use locks, or suspending tasks in the job scheduler?

The DAG, which is computed as just a list of buckets rather than an actual graph, is somewhat like a list of instructions given to the scheduler system. So, for instance, the resulting organization as outlined above: “game time”, “update spatial” → “separation force” → “alignment force” → “cohesion force” → “apply force” → “render dispatch”, becomes:

parallel("game time")
parallel("update spatial")
barrier()
parallel("separation force")
barrier()
parallel("alignment force")
barrier()
etc

The barriers are the only places with any form of blocking, they just wait till all threads arrive before allowing the next parallel instruction to begin execution. I tend to use an exponential backoff spin in such a case rather than actual locks because typically the delay at the barriers is extremely low, down in the 10us or lower range. As you add more systems into your game, the number of barriers doesn't tend to grow very fast and you end up with a lot more cases of multiple parallel calls between barriers. With my custom scheduler there is no delay between parallel calls, when a thread finishes it's work it immediately starts on the next parallel without waiting, so between barriers there are no other locks involved unless the systems themselves introduce them..

Submitting a job for every Entity surely doesn't make much sense, but a scheduler could implement the parallel_for, and be used for other things like async file I/O (resource loading).

I don't use the same scheduler for async file IO and other long running tasks, everything in the world update scheduler must be tightly bounded. During engine initialization their are two different schedulers created, one is basically a work stealing job system for any work that is not something with predictable completion times. The other is a scheduler specific to per frame update, specifically it runs the world update. In order to integrate IO results into the world, you make a request against a shared IO component from within a system and it returns a handle. That handle will simply return null until the wsjs completes the work involved in say loading a mesh and making it ready for the renderer. In this way the world update is always consistent and without blocking, if you absolutely must block the world update for some reason that's done external to the world, i.e. just block the next update call.

There is a lot involved in this dual execution system. For instance the schedulers auto balance the number of threads given to each side. So, if you are writing an RTS and only have 5k entities but they are pounding out path finding calls constantly (path finding is run on the wsjs side), the balancing may choose to give the wsjs 10 worker threads while the world update only gets 2 or 3. Or if you are in a menu screen there may only be 1 thread in each, it's completely dynamic and adjusts as needed. In general, I don't want to use all cores just because they exist, it just doesn't make sense to run 16 threads to update 100 entities, so it runs with one thread till thread utilization hits about 75%, or the update time hits a threshold point and then it adds another thread. On a 32 core 2990wx it takes over 30,000 flocking entities before it fully maxes out the thread counts.

Do you put the barrier after/lock around the whole System, or some more granular approach?

Think that was answered above in the DAG explanation. There are more rules in the dag than I have outlined and as part of turning that into the series of instructions fed to the scheduler I actually run a bit of an optimizer pass. The scheduler is a different sort of beast than most folks expect because even “barrier” is a higher level concept than the underlying control system. I won't go into details but generally speaking there are more instructions involved and you can organize them to avoid even more blocking with a little intelligence. The primary goal is that if you have a series of parallel calls without barriers in between, there are absolutely no blocking bits and a thread that finishes one parallel call moves immediately into the next without any blocking or overhead. Threads only slow down if there is an intentional barrier, they run out of work, or if the work in the tasks does something blocking (bad). This was the result of a lot of work on how to avoid Amdahl's law while executing a game loop in massive parallel execution, it pays off but was several years of research work.

This topic is closed to new replies.

Advertisement