High Performance Computing (HPC) in the Cloud
Guest Editor's Introduction • Dejan Milojicic, HP Labs • September 2012
High performance computing is no longer limited to those who own supercomputers. HPC's democratization has been driven particularly by cloud computing, which has given scientists access to supercomputing-like features at the cost of a few dollars per hour. The four articles I've selected for this month's Computing Now theme highlight some current work in academia and industry, examining key benefits, challenges, and enablers of HPC in the cloud.
Benefits of HPC in the Cloud
Interest in HPC in the cloud has been growing over the past few years. The cloud offers applications a range of benefits, including elasticity, small startup and maintenance costs, and economics of scale. Yet, compared to traditional HPC systems such as supercomputers, some of the cloud's primary benefits for HPC arise from its virtualization flexibility. In contrast to supercomputers' strictly preserved system software, the cloud lets scientists build their own virtual machines and configure them to suit needs and preferences. In general, the cloud is still considered an addition to traditional supercomputers -- a bursting solution for cases in which internal resources are overused, especially for small-scale experiments, testing, and initial research. Clouds are convenient for embarrassingly parallel applications (those that do not communicate very much among partitions), which can scale even on commodity interconnects common to contemporary clouds.
Challenges for HPC in the Cloud
Broader use of HPC in the cloud also presents some key challenges. Primary among them is the lack of high-speed interconnects and noise-free operating systems to enable tightly coupled HPC applications to scale. Today, at best, the cloud offers 10-Gigabit Ethernet, whereas supercomputers commonly use Infiniband and proprietary interconnects. New optimized virtualization models (thin VMs, containers, and so on) are reducing virtualization costs, but they're still not noise free. Other challenges include
- the costing/pricing model, which is still evolving from the traditional supercomputing approach of grants and quotas toward the pay-as-you-go model typical of cloud-based services;
- the submission model, which is evolving from job queuing and reservations toward VM deployment;
- the bringing of data in and out of the cloud, which is costly and results in data lock-in; and
- security, regulatory compliance, and various "-ilities" (performance, availability, business continuity, service-level agreements, and so on).
Enabling the Future of HPC in the Cloud
The adoption rate of HPC in the cloud remains unclear, but in the long term, several technology developments can directly address the challenges. The first opportunity is in optical networking, which will improve hardware manageability and interconnect performance while also reducing power consumption, and software-defined networks, which will further improve the manageability of cloud networking. Furthermore, developments in nonvolatile memory will improve checkpointing performance and, in the long term, address the data deluge and new programming models leveraging nonvolatility. Both technologies will create disruptive improvements and increase HPC's adoption in the cloud. As the scientific community moves toward exascale computing, the evolution of new asynchronous programming models and an emphasis on redesigning algorithms can make the cloud even more suitable for HPC.
Numerous IEEE Computer Society magazines, transactions, and conferences are publishing articles on cloud computing. For Computing Now's September 2012 theme, I selected four of these articles and provided links to seven others in the IEEE CS Digital Library.
John Rehr and his colleagues conducted some early experiments on the viability of using commercial clouds for scientific computing. In "Scientific Computing in the Cloud," they found that Amazon is suitable for applications that don't require tightly coupled execution (and hence higher-quality interconnects than typically offered in clouds), both for serial execution and scalable parallel execution. Among the major roadblocks they discovered were a lack of convenience in deploying and using the cloud for HPC — including the libraries, 64-bit operating systems, and HPC compilers — and inability to easily deploy different-sized clusters. In general, they call this an "alien" environment for HPC users. The authors go on to demonstrate a set of tools they developed to make the cloud easier to use, including scripts, special-control Amazon Machine Images (AMI) from which the user can launch clusters, security support, front-end support for managing clusters (Elasticfox), and so on. Despite some of the limitations, Rehr and his colleagues found cloud computing to be suitable for executing certain applications of smaller sizes with acceptable performance for limited price.
In "Building a Cloud Computing Platform for New Possibilities," Yousef Khalidi describes the design principles and application model of Microsoft's Azure platform as a service (PaaS). HPC applications present one of the five emerging adoption scenarios for Azure — particularly for use in the financial services, life sciences, and manufacturing industries. Some features that Azure offers in support of HPC are abstracting away hardware notions of the infrastructure as a service (IaaS) and exposing the platform to those who care about the science rather than the computation. In addition, support for different roles, regulatory compliance, availability, scalability, and maintenance (automatic upgrades) also helps facilitate scientists' attempts to develop, execute, and maintain their HPC applications as services.
In "Massively Parallel Fluid Simulations on Amazon's HPC Cloud," Peter Zaspel and Michael Griebel analyze the scaling of a computational fluid dynamics (CFD) simulation on the Amazon cloud. The authors identify acceptable CFD scaling in the range of typical industrial engineering applications, thus showing that clouds are already a competitive, cost-effective alternative to mid-size supercomputers. More specifically, the authors demonstrated good-to-acceptable speed-up with more than 70 percent parallel efficiency for up to 8 CPU instances; for up to 32 instances (256 CPU cores), parallel efficiency was more than 50 percent. They also observed good-to-acceptable strong scaling for up to 4 instances per 8 GPUs and good weak-scaling efficiency (75 percent weak scaling) on 16 GPUs.
Finally, Oliver Niehörster and his colleagues' "Enforcing SLAs in Scientific Clouds" presents a simple, easy-to-use tool designed to abstract away the cloud environment. The tool is focused on research problems rather than the computational resources and management details of the application or the underlying cloud infrastructure. In addition, it automates VM size configuration, with the goal of optimizing the mapping of VMs to hosts. The authors leverage an autonomous agent-based approach for SLAs, evaluating initial application deployment, estimated cost, and subsequent SLA fulfillment. In addition, they account for the noise resulting from third-party VMs in the cloud. Their tool seeks to enable efficient execution of mid-size parallel scientific applications.
We've learned from these articles that the cloud is a viable platform for medium-scale HPC applications that aren't tightly coupled; the continued importance of interconnects is the major obstacle to cloud computing's broad adoption for larger-scale, more tightly coupled HPC applications; and the ease of use of HPC applications in the cloud needs to be addressed at all layers in the cloud (infrastructure, platform, and software as a service). The cloud is unlikely to ever replace supercomputers, but it is an increasingly intriguing platform for HPC applications, and the adoption rate is improving in favor of the cloud. This is very similar to the use of mainframes in large financial organizations that can afford it while smaller financial institutions increasingly use clouds as a cost-effective means to achieving similar goals.
Dejan Milojicic is a senior research manager and scientist at HP Labs, and an IEEE Fellow. He is editor in chief of Computing Now and chair of the IEEE CS Special Technical Communities. Contact him at firstname.lastname@example.org