// now, to optimize the s-inside loop: // 1. optimize away all the index calculations, using // two separate pointers each for d & e // 2. move off the last 'l' iteration to its own copy // 3. in the main one, rlim is now always odd; unroll r loop once // 4. move the first update of r values after second copy (by updating all references to it) // 5. rename all variables from second copy // 6. move all variables up into first loop (the point of this strategy is // that they're actually all small deltas away) // 7. condense down to the small deltas (now we're updating 8 items per iteration)