Boy was I ever surprised by the results. I thought for certain that my rasterizer was the bottleneck of the whole pipeline, since it was doing the most computational work - perspective interpolation being the largest consumer. Here is the percentage graph of what is actually being used by function call:
![](http://members.gamedev.net/JasonZ/Flexline/Flexline_020.jpg)
Let me translate for you from bottom to top of the graph:
1. CAdapter2D::setElement -> Setting a single element in a 2D buffer
2. CAdapter2D::setElement -> Same as 1 but with a different template type
3. CDepthTestUnit::operate -> Running the z-depth test compare (!!!)
4. WinMain -> no explanation needed here
5. CSampler2D::sample_point -> Point sampling a texture - no filtering used(!!!)
6. CRasterizer::operate -> the rasterizer comes in at number 6?
So overall what this thing is telling me is this: memory access is killing my performance. Three different memory access functions are taking more execution time than the rasterizer. Of course they are called many more times than the rasterizer, but a small improvement in performance one of those functions will give a much bigger return than an improvement in the rasterizer.
So from this point out, I will start to check into how to improve the memory access pattern. This shouldn't be too hard, since I have absracted the memory interface from the processors - if I want to use a different storage mechanism or make a different accessing order, I should have the flexibility to do so. I'll be working on this for the next while, so I'll post the results when I get somewhere...