<p><b>Abstract</b>—We present a hardware-algorithm for sorting <tmath>$N$</tmath> elements using either a <it>p</it>-sorter or a sorting network of fixed I/O size <tmath>$p$</tmath> while strictly enforcing conflict-free memory accesses. To the best of our knowledge, this is the first realistic design that achieves optimal time performance, running in <tmath>$\Theta ( {\frac{N \log N}{p \log p}})$</tmath> time for all ranges of <tmath>$N$</tmath>. Our result completely resolves the problem of designing an implementable, time-optimal algorithm for sorting <tmath>$N$</tmath> elements using a <it>p</it>-sorter. More importantly, however, our result shows that, in order to achieve optimal time performance, all that is needed is a sorting network of depth <tmath>$O(\log^2 p)$</tmath> such as, for example, Batcher's classic bitonic sorting network.</p>