Upcoming processor generations increasingly provide instructions for sub-word parallelism. Thus, a parallel execution of 2, 4 or 8 instructions (add, sub) or of complex instructions (sum of differences) with an input of 2, 4 or 8 operand pairs becomes possible. The exploitation of sub-word parallelism is still weakly supported by current compilers. To remedy this we have adapted methods from the design of parallel regular processor arrays. The causal-ity constraints which influence the design flow of processor arrays can be relaxed for processors with sub-word parallelism. An algorithm calculating the Mahalanobis distance is used to illustrate the influence.
Based on this extended approach, we have obtained significant speed-ups of our test-vehicle, of up to a factor 3 on an Intel P4. In the conventional approach, assembly-level coding would have been required to achieve this.