Issue No.03 - March (2006 vol.7)
Published by the IEEE Computer Society
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MDSO.2006.18
As cluster computing's popularity increases, so does the need to prioritize workflow. Researchers at the Technology Research, Education and Commercialization Center are developing system software for just that purpose: scheduling computing cycles on demand.
As more and more researchers turn to cluster-based systems to handle intensive computing tasks, the need to prioritize those tasks also increases. Researchers at the University of Illinois at Urbana-Champaign's Technology Research, Education and Commercialization Center ( http://www.trecc.org) are developing system software for just that purpose: scheduling computing cycles on demand.
Technical project lead Greg Koenig says TRECC's approach to secure cluster computing on demand is a matter of managing workflow. "If a lower-priority job is consuming resources when a higher-priority job comes along, the scheduler must have enough information to determine whether it can fulfill the higher priority job's requirements, such as a deadline, with the remaining available resources," Koenig says. "If this cannot happen, then some mechanism must exist for clearing the resources for the higher-priority job."
The mechanism is usually some kind of checkpoint or restart component. In fact, the group's checkpoint-restart component was the first to demonstrate functionality. However, Koenig says, using more sophisticated mechanisms might make it possible to schedule resources in a way that fulfills a higher-priority job's requirements while simultaneously avoiding disrupting those of a lower-priority job.
"We leverage a lot of third-party software components (including virtualization, fault monitoring, and co-allocation of resources) in order to take advantage of external work that we do not want to reinvent." Of these, Koenig believes TRECC's work in virtualization is in step with one of the hottest trends in cluster computing.
Virtualization works by effectively splitting one or more cluster nodes into a greater number of virtual nodes. To do this, the scheduler needs additional information that lets it decide whether a given job can even run effectively under a virtual environment. For example, can two jobs running under virtualization still meet their respective deadlines if they each receive 50 percent of the cycles from the processors they share?
"Virtualization is going to be big, and in a couple years the computing community will start seeing it used many places in many interesting ways," says Koenig. He says Intel's Vanderpool and AMD's Pacifica represent hardware-based technologies that will help minimize virtualization's overhead.
The virtualization component currently lets researchers start and stop virtual machines on demand and launch as well as successfully run MPI (Message Passing Interface) jobs inside a virtual machine. However, the TRECC team is still fine-tuning it. The group predicts that the fault-monitoring component, which identifies failing processors and can migrate pieces of the computation away from them, will be completed by midsummer 2006. The cluster co-allocation component, along with the intelligent middleware extensions for latency masking and heterogeneity management, should be completed by late summer.
Still, these researchers have their work cut out for them. According to Phil Papadopoulos—program director for grid and cluster computing at the San Diego Supercomputer Center at the University of California, San Diego—on-demand computing is as much a policy problem as a technology problem.
"The primary technology problem is not just checkpointing, but checkpointing an intercommunicating parallel program and then restarting it on different sets of resources," he says. "The policy part of on-demand is how do you prioritize on-demand requests, especially if you have multiple, conflicting requests."
Papadopoulos says the underlying binding of message-passing ports to physical hardware remains an open engineering-performance issue. However, he notes that companies such as Sun and IBM are building utility computing environments where available flops (floating-point operations per second) are negotiated in a service-level agreement.
TRECC program manager EJ Grabert says that as more protracted on-demand strategies to run jobs are developed, the chances of component failure increases. "To that end, a fourth area of our on-demand work focuses on detecting Byzantine failures in computations and then allowing graceful recovery from these failures without having to restart a job from the beginning," he says.
"[TRECC is] attacking a difficult problem by looking at process-to-process isolation, fault-resilience, and on-demand with same set of techniques," Papadopoulos concludes. "This sort of applied research is essential as our domain-specific scientific research increasingly relies on information technology to accomplish its goals."