Software Distributed Shared Memory (S-DSM) provides shared address space at run-time and accepts a wide range of applications on parallel computer systems with commodity hardware. S-DSM caches remote data in the local memory in order to reduce remote-memory-access latency.
This paper proposes the methods for further reducing remote-memory-access latency in S-DSM by utilizing an optimizing compiler that directly analyzes explicitly parallel shared-memory source programs. That is to say, this paper suggests the compiling techniques of issuing prefetch for remote-memory access and introduces the framework that enables prefetch mechanism.
I have implemented this compiling technique in optimizing compiler, Remote Communication Optimizer :RCOP. I also have implemented the lightweight run-time systems on PC cluster connected with the Gigabit Ethernet (1000BASE-T). The experimental results using the SPLASH-2 benchmark suite show that the prefetch technique is effective for applications with coarse-grained synchronization.
In order to obtain high performance, it is necessary to choose appropriate framework according to the characteristics of applications and platforms.