Do you have a source where you got the equations? Do you know what the limitations are, if any, for accuracy?
This is my own implementation of quaternion differentials. I didn't get it from any web source or anything. I'm a mathematics undergrad, but I was suprised at how long it took me to understand this and get it right. I'm not sure I still completely understand it.

If you notice, I wasted some RAM to maintain consistency between the array position and quaternion dimensions.
As for accuracy, I haven't done any mathematical estimates of the error. Instead, I optimized the output with some Python scripts using LOTS of real flight data. I have a another degree in Ops Research with a specialization in numerical optimization. Its intended to be a 32-bit floating point implementation for maximal accuracy. Its the best I could get it when compared to a 64-bit floating point implementation of the same algorithm updated at every cycle. Honestly, the gyro offset is probably an order of magnitude bigger source of error than this code.
It seems like you could just do the quaternion update for every gyro sample (e.g. 1000 Hz), and then compute the rotation matrix and convert to flight angle and roll at a lower frequency if desired.
I found that updating the quaternions 100 times per second gave the best accuracy. I think this is the case because at 1000Hz the 32-bit pieces get really small and the error builds. Accuracy is preserved by updating the partial rotations at every cycle, but only updating the quaternion at 100 times per second. I haven't done any mathematical error estimates (yet) so I could be wrong, but when compared to my other method I became convinced it was "close enough" and focused attention on other parts of the code.
When you said that an Arduino Uno could keep up with doing everything each cycle, was that with the 8-bit Atmega running at 16 MHz and no floating point unit?
Yes on the Uno, but it used another rotation routine called getDCM2DRotn(), which is still in the rotation.ino file. This is the one I wrote the NAR research paper on. I originally ran my code on an 8-bit Uno at 16MHz and quaternions were too slow. I came up that algorithm after studying the Direct Cosine Method over a TON of beers one night. I was surprised it worked, and it took me another month to write the mathematical proof of why it worked. I'm still not totally sure that proof is correct, but its in the NAR paper. The algorithm is very fast on an Uno due to the use of trigonometric small angle approximations and only having a few floating point operations.
Below are a few plots comparing the two methods from one of my flights this past weekend. The error plot compares the two methods against a 64-bit implementation of quaternion differentials updated at every sample, which is assumed to be the most precise as possible. DCM2D tends to lose accuracy the closer it gets to 90 degrees, which is expected since it uses Euler angles which have the gimbal-lock problem at 90 degree rotations. Since I didn't care about any rotations past apogee, the sacrifice was worth the performance, especially on an Uno.
