// now, to optimize the s-inside loop:
   //    1. optimize away all the index calculations, using
   //         two separate pointers each for d & e
   //    2. move off the last 'l' iteration to its own copy
   //    3. in the main one, rlim is now always odd; unroll r loop once
   //    4. move the first update of r values after second copy (by updating all references to it)
   //    5. rename all variables from second copy
   //    6. move all variables up into first loop (the point of this strategy is
   //           that they're actually all small deltas away)
   //    7. condense down to the small deltas (now we're updating 8 items per iteration)