As you can see, I'm in dire need of a better way to do final gather, haha. This was just a hack job over a weekend, of course. Right now I'm just rendering each photon contact as a point light, which is just under 100,000 in total. As a result, it's only ~10fps. Lots of room for improvement.
The OpenCL code for doing the bounces
Of course, this being a new technology there are a lot of caveats. First, OpenGL interop is tricky to get working. Don't forget to pass the OpenGL context and OS device handle to OpenCL! When sharing multiple buffer objects, I found OpenCL refused to function correctly, saying it was out of resources. Instead, I just lumped all the data to be shared into one giant VBO/cl_mem object, and it was mostly OK. Still, with large buffers (e.g. 500,000 particles) the OpenCL kernel seems to eventually give up and stop running after a while. I have no idea why. Finally, uploading images directly seems to be a problem, but if you create the image and then copy the data separately, it seems to be fine.
This in on Nvidia, mind you. I hear ATI is much better.