After wondering why it was taking so many samples to get good precision, I decided to try to understand better what was going on. So I forced the texture samplers to use the top level mip map by setting the derivatives to zero in the tex2Dgrad HLSL function. The normal/heightmap that I was using is a 256x256 texture, so I forced the algorithm to use 300 samples (yes, per texel - it took a while to render a frame...) - which should produce a perfect image since the heightmap is being oversampled.
I then saved a screen shot of the output and then reduced the sampling rate to 256, then to 200, then to 128, ... , and finally decided that something was wrong. Even the oversampled version was not perfect and had some aliasing on edges that were perpendicular to the view edge. And I know that it doesn't take that many samples to get a good image using other people's implementation.
That was when I started modifying the interpolation code - I am not sure how I came up with what I was using before (has that ever happened to you???), but now it is very simple and fast, and actually reduced the instruction count by 2[grins]! Using 16 texture samples produces the exact same image as using 50 samples in my test scene, so I am very happy with the results.
The moral of the story is that you should always be open to re-examining old code - even if you think it is proven to be stable!
Since my laptop only has a shader model 2.0 card in it, I have been generating a few demo movies offline by saving screenshots one at a time. I'll post some of them once I get them all built and compressed!