First (truly shocking!) optimization benchmarks

Any technical questions for Waterloo go here!
Post Reply
Guest
Reactions:

First (truly shocking!) optimization benchmarks

Post by Guest »

I've just played the tutorial.
The camera's movement was showing some serious lag, but not so annoying.
That scenario just had a few units on the ground, though.
I enjoyed it overally.

From the previous days talks on this forum, I was thinking this game was CPU (AI, LOS, anims updating, etc.) bound, rather than GPU.
Hence profiling it with MS DirectX PIX sounded like a formality. I did it anyway...
So I picked "WL01 - The Emperor's Plan", looking at the scene's left from the French starting position, as benchmark.

They were by far the most shocking results I ever run across with such a tool:

- 5000 DPUPs per frame, meaning one of the most deprecated DX9's draw call ("data specified by a user memory pointer") per sprite instance (plus all terrain stuff)!
This can be very easily converted to 1 x DIP per sprite type (not instance!!!) with Hardware Instancing (SetStreamSourceFreq).
I suppose 5000 DPUPs to 50 DIPs thereafter?
- 5500 SetRenderStates per frame, same as above (one per sprite instance, plus terrains).
This can be very easily converted to 5 x SetRenderState for the whole sprite renderer.
5000 SetRenderState vs 5 SetRenderState?
- 5000 SetVertexShader + 5000 SetPixelShader per frame, again the same problem.
This can be very easily converted to 1 x SetVertexShader + 1 x SetPixelShader for the whole sprite renderer.
5000+5000 vs 1+1?
- 22000 SetTexture calls, which should come from the fact that one sprite instance needs more than one texture switch (average of 4?) because of composition (trousers, horses, hats, etc.).
Some terrain stuff is also included in the estimate.
Yet, incredible. You should batch draw calls sharing material.
20000 to 250 SetTextures?
- 6600 x SetVertexShaderConstant per frame;
- 6600 x SetPixelShaderConstant per frame;

- The total DX9 API calls are steadily > 200.000 per frame.

DirectX 9 APIs are quite inefficient on their own, abusing in such a way is performance-killer at the nth power.
This is inescapable, so please fix it as soon as possible.
It may really do all the difference of the world with a limited effort.

Back to fun now.
Last edited by Guest on Wed Jun 10, 2015 12:55 pm, edited 1 time in total.
con20or
Reactions:
Posts: 2541
Joined: Fri Jun 11, 2010 8:49 pm

Re: First (truly shocking!) optimization benchmarks

Post by con20or »

Moving this to the technical section.
User avatar
norb
Reactions:
Posts: 3778
Joined: Mon Nov 26, 2007 9:59 am
Location: Central Florida
Contact:

Re: First (truly shocking!) optimization benchmarks

Post by norb »

It would be awesome if we had a larger coding staff. I know, our engine is older and does not take advantage of many upgrades today. It's one of those areas that we can't do as well as we would want if we had more resources. I would love the time to upgrade the entire graphics engine... wishing.
Gunfreak
Reactions:
Posts: 415
Joined: Thu Jul 17, 2008 7:26 pm

Re: First (truly shocking!) optimization benchmarks

Post by Gunfreak »

I've just played the tutorial.
The camera's movement was showing some serious lag, but not so annoying.
That scenario just had a few units on the ground, though.
I enjoyed it overally.

From the previous days talks on this forum, I was thinking this game was CPU (AI, LOS, anims updating, etc.) bound, rather than GPU.
Hence profiling it with MS DirectX PIX sounded like a formality. I did it anyway...
So I picked "WL01 - The Emperor's Plan", looking at the scene's left from the French starting position, as benchmark.

They were by far the most shocking results I ever run across with such a tool:

- 5000 DPUPs per frame, meaning one of the most deprecated DX9's draw call ("data specified by a user memory pointer") per sprite instance (plus all terrain stuff)!
This can be very easily converted to 1 x DIP per sprite type (not instance!!!) with Hardware Instancing (SetStreamSourceFreq).
I suppose 5000 DPUPs to 50 DIPs thereafter?
- 5500 SetRenderStates per frame, same as above (one per sprite instance, plus terrains).
This can be very easily converted to 5 x SetRenderState for the whole sprite renderer.
5000 SetRenderState vs 5 SetRenderState?
- 5000 SetVertexShader + 5000 SetPixelShader per frame, again the same problem.
This can be very easily converted to 1 x SetVertexShader + 1 x SetPixelShader for the whole sprite renderer.
5000+5000 vs 1+1?
- 22000 SetTexture calls, which should come from the fact that one sprite instance needs more than one texture switch (average of 4?) because of composition (trousers, horses, hats, etc.).
Some terrain stuff is also included in the estimate.
Yet, incredible. You should batch draw calls sharing material.
20000 to 250 SetTextures?

DirectX 9 APIs are quite inefficient on their own, abusing in such a way is performance-killer at the nth power.
This is inescapable, so please fix it as soon as possible.
It may really do all the difference of the world with a limited effort.

Back to fun now.
Is this things that can be fixed by you with mods or the devs now that you have told them? Or is this inate in the engine?
Guest
Reactions:

Re: First (truly shocking!) optimization benchmarks

Post by Guest »

From a rather quick look at the DLL, I don't think the PowerRender engine is that bad after all, Norb. I might be wrong however.
What we'd really need at least are some structures aiming for optimal rendering on top of its resources.
Just to name some: display list (back-to-front sorting for entities with transparency) with parallel submission, minimization of changes in render and shader states, stacks of render states and draw calls (DIPInstancing is essential for your sprites), cbuffer emulation to update variables of shaders and so on.
In the recent past I coded for my own (unfinished so far) projects these kind of utilities on top of good old Ogre 1.9 (DX9 render system only) with success and remarkable benefits.
I'd even offer my humble services, although my CV is fairly ridicolous. But let's talk about this tomorrow after some sleep. 3.25 hours of bike racing at 31.5 km/h avg speed may had some impact on my clarity of mind. :)
Ciao.
Jim
Reactions:
Posts: 1082
Joined: Tue Nov 27, 2007 8:53 am

Re: First (truly shocking!) optimization benchmarks

Post by Jim »

Norb is 100% of the entire coding staff for the main game as well as for the graphics engine. Mitra does the DLL AI coding which is a major job in itself. Everyone else multitasks in other areas as shown in the game credits. All of which is done (mostly) on a nights and weekend schedule. So when we say we have limited resources, we are not just blowing (black powder) smoke.

-Jim
"My God, if we've not got a cool brain and a big one too, to manage this affair, the nation is ruined forever." Unknown private, 14th Vermont, 2 July 1863
Guest
Reactions:

Re: First (truly shocking!) optimization benchmarks

Post by Guest »

Nobody is going to bring that into question, Jim. Never. :)

Nevertheless, considering my statistics as proof (and I had been conservative for sure, probably forgetting additional thousands of DX9 API calls on per-sprite-instance basis), it should be clear enough that some very limited/careful investment of the available resources/earnings would be well worthwhile the benefits.

Can you really figure out, after reducing those calls by the order of magnitude I pointed above, how many new features you could potentially add to the game in the coming years without spending most of your time for worried measurements and considerations (or even facing impossibility)?

Just take a look at the final appendix here and multiply accordingly.

Moreover, achieving higher frame rates, you could enable vsync (some users are already asking for it), which would result in much fewer complains of customers about burned GPUs.

If Norb wants, he knowns how to contact me. ;)

Nicolò
Last edited by Guest on Wed Jun 10, 2015 12:23 pm, edited 1 time in total.
Guest
Reactions:

Re: First (truly shocking!) optimization benchmarks

Post by Guest »

A few more stats on the same scene to complete the picture:

- 6600 x SetVertexShaderConstant per frame;
- 6600 x SetPixelShaderConstant per frame;

- The total DX9 API calls are steadily > 200.000 per frame.

It's running at 10 FPS on my 4 years old PC. But this is a subjective reference...
Last edited by Guest on Wed Jun 10, 2015 1:00 pm, edited 1 time in total.
Post Reply