DirectX 11 video player - Performance considerations

Started by
9 comments, last by desiado 4 years, 3 months ago

Hi all, this is my first post here.

After months of studying and googling, I've managed to realize a desktop video player:

  • wpf/d3dimage for the frontend
  • directx 11 for the rendering backend

The two worlds are in communication through surface sharing: that is, directx 11 renders on a shared texture so that d3dimage (d3d9) on wpf side can present the frame.

Directx 11 is fed by GStreamer decoder, producing out I420 planar buffers.

During my research, I've understood that the preferable approach would be to split the buffer into 3 textures (separating the Y-U-V components) and then let a pixel shader reconstruct the ARGB view, so I did it that way.

My previous implementation was entirely based on GStreamer plugin, including video output using D3DVideoSink plugin. Comparing the two solutions, the bitter surprise: I've got a +30% of CPU usage and a +45% of GPU usage.

CPU overhead is mainly due to continuous textures/shader_resource_view creation, while GPU overhead…I really don't know.

I've tried to create Dynamic textures once and then map/memcpy/unmap them, but results is worse.

As far as I understood, my DirectX 11 stage is really simple…vertex shader + linear pixel shader, 3 textures (y-u-v), 3 shader resource view, context draw/flush on each frame.

Is there a more efficient way to manage yuv frame?

Am I badly missing something?

Advertisement

Sorry for not answering your question direct, but maybe the performance issue is how the D3Dimage is being used in your WPF app. I believe there can be some performance gotchas in some cases if you’re not careful.

Instead of using a D3DImage, have you tried rendering in DirectX as normal but present in a WPF WindowsFormsHost element in your WPF window.

e.g. Set the WindowsFormHost.Child to a windows form control object which then allows you to get the control handle (hwnd) to pass to DirectX.
If performance improves you know it is something to do with the D3DImage usage.

Also make sure you back buffer is D3DFMT_A8R8G8B8 or D3DFMT_X8R8G8B8 etc.

I've written a WPF video player in the past with directx9 and used the method above.

@desiado thanks for your reply.

The idea is to go beyond the airspace issue between wpf and winforms, so I cannot take into account the “mixed” approach.

And I know that D3DImage is not so efficient, but I think this is not the topic.

Indeed, even if I comment out the presentation to D3DImage (Lock…AddDirtyRect…Unlock), the Directx 11 stage is consuming about 40% of my i7-8750H to render 40 instances of a 704*576@15fps stream.

Same video configuration takes 30% of cpu using older D3DVideoSink implementation.

This overhead, according to VS performance profiler, is due to the continuous Texture2D creation, but then again I cannot figure out how to feed the stage in a more efficient way.

Well for starters, you should be running a round robin of textures rather than using the same one each frame (or reallocating each frame which is crazy). So that's nine textures and nine SRVs and loop back around every third frame. That might be the core problem or it might not.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

@promit thank you, but I'm not sure I understand your suggestion.

I'm processing live streaming, how can I prefetch textures?

@fcetrini Remember that the GPU is generally operating a frame or two behind what you're currently doing.

  • Frame 1: Map texture, set data, unmap texture, render.
  • Frame 2: Map texture - but frame 1 is still being processed. Now there's a stall until the resource is clear.

There are two ways to deal with this. One is to pass the WRITE_DISCARD flag with your Map. This isn't perfect but it will work fine in a lot of these cases, and it might be good enough here. The second way that has become preferred for streaming resources is to make a round robin of textures. Now the flow is like this:

  • Frame 1: Map texture 1, set data, unmap, render.
  • Frame 2: Map texture 2, set data, unmap, render. Both texture 1 and 2 are now in use.
  • Frame 3: Map texture 1 again, frame 1 is hopefully done processing at this point so we're good.

I've written it as a flip-flop between two textures here for brevity, but 3 is considered ideal for most uses.

The GPU usage side of this might be more complex, given the WPF interaction. It's possibly worth prototyping this out in a raw WinForms dedicated window setting first to confirm that the base implementation is right.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

@Promit thank you again.

I've tried to implement a pool of textures/srvs, but I'm not improving cpu usages, maybe there is something I'm not understanding well at the base.

Let's say we have a 15 frame per second live streaming: in my implementation, I'm creating 3 brand new textures and shader resource view for each frame, the total is 45 textures and 45 shader resource views every second.

With a pool of pre-existing textures, I've prepared 9 textures and 9 srvs, but I have to map/memcpy/unmap 45 textures every second: that is, the number of elaborations is the same.

To explain better, this is more or less my current implementation:

  • vertex shader/vertex buffer
  • pixel shader / sampler state
  • a single B8G8R8A8_UNorm Texture2D with D3D11_RESOURCE_MISC_SHARED (interop with wpf d3dimage)
  • render target view/output merger set on that texture

Then, on each decoded frame arrived:

  • split frame on 3 planes
  • create 3 Texture2D (one for each plane)
  • set 3 shader resource views on pixel shader
  • context draw/flush

I'm pretty sure there is something wrong in the basic architecture, also because of the high GPU usage:

fcetrini said:
Then, on each decoded frame arrived: split frame on 3 planes create 3 Texture2D (one for each plane) set 3 shader resource views on pixel shader context draw/flush

Don’t create new textures on each arrival of a decoded video frame. Instead, re-use from the pool of textures you created from the outset in a round-robin type manner.

@desiado I've tried that way, but it's consuming 45% cpu vs 39/40 of the “create always” approach.

I think the bottleneck is the memcpy after map, to update the textures.

I don’t know which version of D3DVideoSink you were using, but my understanding is that at it’s core it is using this dynamic texture method. I.e. create texture at initialisation and then after each arrival of a decoded video frame, map (lock), copy video frame, unmap (unlock) and render

As D3DVideoSink is open source, perhaps step through your old method with a debugger and try and work out how it is different (architecturally) from your new implementation. This should then help you narrow down the problem.

This topic is closed to new replies.

Advertisement