<p>Presents new principles for online monitoring in the context of multiprocessors (especially massively parallel processors) and then focuses on the effect of the aliasing probability on the error detection process. In the proposed test architecture, concurrent testing (or online monitoring) at the system level is accomplished by enforcing the run-time testing of the data and control dependences of the algorithm currently being executed on the parallel computer. In order to help in this process, each message contains both source and destination addresses. At each message source, the sequence of destination addresses of the outgoing messages is compressed on a block basis. At the same time, at each destination, the sequence of source addresses of all incoming messages is compressed, also on a block basis. Concurrent compression of the instructions executed by the PEs is also possible. As a result of this procedure, an image of the data dependences and of the control flow of the currently running algorithm is created. This image is compared, at the end of each computational block, with a reference image created at compilation time. The main results of this work are in proposing new principles for the online system-level testing of multiprocessor systems, based on signaturing and monitoring the data dependences together with the control dependences, and in providing an analytical model and analysis for the address compression process used for monitoring the data routing process.</p>
Index Termsparallel machines; computer testing; error detection; probability; network routing;concurrent test architecture; massively parallel computers; error detection; onlinemonitoring; multiprocessors; aliasing probability; system level monitoring; run-timetesting; data dependences; control dependences; message source address; messagedestination address; block compressed sequence; concurrent instruction compression;control flow checking; computational block; reference image; compilation; onlinesystem-level testing; signature analysis; data routing process; packet-switched routing

M. Sugie, Y. Sato, K. Iwasaki and M. Hancu, "A Concurrent Test Architecture for Massively Parallel Computers and Its Error Detection Capability," in IEEE Transactions on Parallel & Distributed Systems, vol. 5, no. , pp. 1169-1184, 1994.
