Issue No.07 - July (2004 vol.15)
Li Xiao , IEEE Computer Society
Songqing Chen , IEEE Computer Society
Xiaodong Zhang , IEEE
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2004.15
<p><b>Abstract</b>—In a cluster system with dynamic load sharing support, a job submission or migration to a workstation is determined by the availability of CPU and memory resources of the workstation at the time . In such a system, a small number of running jobs with unexpectedly large memory allocation requirements may significantly increase the queuing delay times of the rest of jobs with normal memory requirements, slowing down execution of each individual job and decreasing the system throughput. We call this phenomenon the <it>job blocking</it> problem because the big jobs block the execution pace of majority jobs in the cluster. Since the memory demand of jobs may not be known in advance and may change dynamically, the possibility of unsuitable job submissions/migrations to cause the blocking problem is high, and existing load sharing schemes are unable to effectively handle this problem. We propose two schemes to address this problem. The first scheme, <it>Network RAM supported load sharing</it>, combines job migrations with network RAM, which uses remote execution to initially allocate a job to the most lightly loaded workstation and, if necessary, network RAM to provide a global memory space for the job larger than it would be available otherwise. This scheme has the merits of both job migrations and network RAM. Our experiments show its effectiveness and scalability. However, this scheme requires a network RAM facility in the cluster, which may cause additional overhead and increase cluster network traffic. In order to address this limit, we propose a second scheme, <it>memory reservation</it>, incorporated with dynamic load sharing, which adaptively reserves a small set of workstations to provide special services to the jobs demanding large memory allocations. As soon as the blocking problem is resolved by the memory reservation scheme, the system will adaptively switch back to the normal load sharing state. Both schemes target on handling large data-intensive jobs in clusters, and are mutually complementary. The network RAM supported load sharing scheme can fully utilize the cluster global memory space, while the memory reservation scheme has the advantage of simple implementations and low overhead. Thus, they both can be effective alternatives, and practically deployed in cluster computing under different system conditions.</p>
Cluster computing, distributed systems, load sharing, job blocking, memory-intensive workloads, trace-driven simulations.
Li Xiao, Songqing Chen, Xiaodong Zhang, "Adaptive Memory Allocations in Clusters to Handle Unexpectedly Large Data-Intensive Jobs", IEEE Transactions on Parallel & Distributed Systems, vol.15, no. 7, pp. 577-592, July 2004, doi:10.1109/TPDS.2004.15