I've been seeing situations where I want to take a 32bit array of 32 elements and copy it so that all the bit 0s are copied into element 0, bit 1s copied into element 1, etc for all 32 elements. This is always the processing bottleneck for
↧