This was actually much easier than I thought it would be. There is a sample implementation in the DXSDK, and modifying it to meet what I wanted to do was pretty easy. However, I realized that there is another problem. The D3DXFont system is what I use for text rendering. The system internally creates and sets a texture when it is going to render text. This means that any texture state caching that I am trying to do will not work, since the GUI buttons draw themselves and then their text for each button - essentially making the whole process serial instead of parallel...
So my options are to turn the text into scene graph objects that will be attached to the buttons, or to roll my own text generation technique. In the former, the text objects would get sorted and all rendered in a batch like all of the other geometry. This would let me render the buttons background first, then render the text later on. I don't really lose any functionality to work this way, so I think it will be sufficient. Hopefully there isn't any other nasty problems waiting around the corner for me after I implement this change.