Looking at generated assembly, it turns out the optimizer is not smart
enough to get rid of the loop - use `copyMem` instead.
At least the compiler is smart enough to constant-propagate runtime
endian direction, resolving the review comment.
Also clarify why a minimum length is enfored - it could perhaps be
revisited, but that would leave a slightly odd API.
the `array` overloads are actually unnecessary with an optimizing
compiler - as long as it can prove the length, any extra checks will go
away on their own
also add `initCopyFrom`
* document optimizations