Instead of using eta=-delta, use zeta=-(delta+1/2) to represent
delta. This variant only needs at most 590 iterations for 256-bit
inputs rather than 724 (by convex hull bounds analysis).
This adds a long comment explaining the algorithm and implementation choices by building
it up step by step in Python.
Comments in the code are also reworked/added, with references to the long explanation.