These SSE optimizations will initially be used on my vector/matrix classes, but will eventually be used to speed up the rasterizer. I think there is enough speed increase available from SSE to speed up a fully bilinear filtered textured rendering. I have seen several tutorials where the affine mapping between perspectively interpolated points on a scanline are used, and the quality looks just fine. However, I want to use that as a last resort - I would rather perform any optimizations that I can and then later on drop down to a coarse interpolation method.
Of course, who knows how long it will be until I hit that barrier...