Building a Cloud Computing Platform for New Possibilities
Yousef Khalidi, Microsoft

SEP 11, 2012 18:17 PM
A+ A A-
March 2011 (Vol. 44, No. 3) pp. 29-34
0018-9162/11/$26.00 © 2011 IEEE

Published by the IEEE Computer Society
Building a Cloud Computing Platform for New Possibilities
Yousef A. Khalidi , Microsoft
  Article Contents  
  Design Principles  
  Hybrid Solutions  
  Embracing a New Application Model  
  Download Citation  

The cloud represents a drastic shift, not only in terms of the new economics it brings to bear, but also in terms of the new possibilities it enables. Realizing the full benefits of this new paradigm requires rethinking the way we build applications.

When cars emerged in the early 20th century, they were initially called horseless carriages, an indication that people defined this new concept in the context of existing technology. Incredibly, automotive engineers even designed whip holders into the first models before realizing they were no longer necessary.
Early designers failed, at first, to understand the possibilities of the new paradigm, such as building for higher speeds or greater safety. When Karl Benz, arguably the inventor of the automobile, attempted to estimate the long-term market opportunity, he concluded there could never be more than 1 million cars because of their high cost and the shortage of capable chauffeurs. 1
But by the 1920s, the number of cars had already reached 8 million, and today more than 600 million are on the road, proving Benz wrong hundreds of times over. What he and other early pioneers failed to realize was that profound reductions in both cost and complexity, plus a dramatic increase in the car's importance to daily life, would overwhelm prior constraints and bring automobiles to the masses.
Today, computing is going through a similar change with the shift from client/server architectures to cloud computing. The cloud promises not just cheaper computing but also that it will be faster, easier, more flexible, and more effective.
Many formerly impractical things suddenly become very practical. Before cars were widely adopted, for example, it was possible, but impractical, for people to live miles away from the city where they worked. Now employees routinely commute over what would have been unrealistic distances, creating new possibilities for business and economic development outside the confines of the urban landscape.
Similarly, the emergence of cloud computing should have a profound effect on the way people think about the possibilities of computing. When it's possible to consume a platform-as-a-service (PaaS) solution from any datacenter in any location, how will that redefine the computing landscape?
It's helpful, of course, to understand what such a platform would look like.
Design Principles
Building a platform that delivers on cloud computing's full promise requires defining what success looks like. Most people in the industry agree on the cloud's two primary benefits: increasing agility and reducing costs. But as in most engineering challenges, the devil is in the details.
To increase developer agility, a cloud platform should provide a rich set of services, such as storage, queuing, and identity services that simplify and accelerate the programmer's job. A rich set of highly available platform services not only increases programmer productivity but can also result in more efficient sharing of services among different applications.
Another important requirement is to free the programmer from dealing with mundane, low-level issues such as hardware configuration, network management, or operating system (OS) patches. Application deployment should be simple and quick. For example, changing the application structure and the communication pattern between various application components shouldn't require tinkering with the network—the programmer should be free from such details.
Similarly, application managers must be able to respond to load changes, frequent software changes, and configuration updates quickly and reliably. They shouldn't spend their time allocating hardware, booting machines, configuring OS images, or managing low-level network equipment. In traditional IT organizations, these activities are time-consuming and distract from the main goal of managing the application.
As for reducing costs, an essential element is automation, which should include everything from server allocation to OS deployment to application life-cycle management (deploying, restarting from failures, and so on). Human intervention should be kept to a minimum. Anything that can be automated should be automated—including the anticipation and mitigation of software and hardware component failures.
Another cost reduction technique is to use standard server hardware with standard networking and to avoid specialized setups, such as separate storage area networks or expensive high-end systems. Attaining the goal of resource elasticity requires sharing large pools of virtually uniform hardware. Automating the management of these large hardware pools further reduces costs.
Finally, to maximize both agility and cost reduction, the application must be written to scale out horizontally, with a message-passing style of communication between its various components. Capacity is added or removed as remote compute units, instead of adding memory and CPUs to each server.
Given these requirements for realizing the cloud's benefits, the fully optimized cloud platform should have the following architectural features:

• a virtualized compute fabric, in which each compute server runs a hypervisor that can allocate resources efficiently and dynamically and isolate them into virtual machines;

• a virtualized network that securely enables collections of virtual machines to be grouped together and isolates unwanted traffic from the rest of the shared hardware pool;

• data, software, and other state at each server node viewable as a cache;

• a scale-out model, in which failures are expected and state can be reconstructed from elsewhere in the network, such as from other nodes running a storage service;

• automation for virtualized compute fabrics, virtualized networks, OS deployment and updates, application deployments, failure handling and application recovery, and application updates;

• a metamodel that lets the platform automate application life-cycle management; and

• rich platform services, accessible through network protocols.

At a minimum, the cloud platform should provide reliable and scalable storage services, including transactional databases, monitoring and logging services, identity services, and metering and billing services.
Hybrid Solutions
Many organizations have data and applications that are impractical to move to the cloud today for myriad reasons: compliance, legal issues, internal controls, economics, and so on. In some cases, the costs of moving—financial and otherwise—simply overwhelm the benefits. Naturally, over time, the number of things that are technically impractical to move will diminish, but practice always lags behind technology.
At this inflection point, it's important to recognize the necessity of hybrid models, which focus on aspects of applications that can move to the cloud sooner than others. It wouldn't be easy to move an entire ERP system, for example, because it shares sensitive data with other onsite applications. However, it's possible to split an online shopping site so that its Web-facing front end goes to the cloud, but its checkout process triggers a secure connection at the datacenter.
The criteria for deciding which workloads or applications to move to the cloud today fall into the following general categories:

Business case. For a given workload and application, is it more economical to move to the cloud? Will the increased agility and elasticity overcome the cost of moving the application? Will moving to the cloud enable new business scenarios?

Technical readiness. Is the application "cloud ready"? The application must retain data in a durable replicated store or replicate it. State in general can be stored locally, but any instance is only a cache. The application shouldn't require lengthy installation or configuration, or any human intervention to start. It must scale horizontally by sharing load across multiple instances (scale out), not vertically within an ever-bigger single instance (scale up).

Dependencies. The application must run without special hardware needs; any dependencies on other applications or data must be addressed before moving to the cloud. Communication latency between the application and dependent onsite resources must also be addressed.

Regulation, compliance, and legal considerations, particularly with respect to data. A public cloud might not be able to host some data due to regulatory issues, for example.

Considering and resolving the core technical issues of making the application "cloud ready" are necessary but not sufficient steps. All dependencies and regulatory requirements must also be addressed before moving an application. Many cases might require careful decomposition, keeping some part of the application onsite and the remainder in the cloud.
Embracing a New Application Model
PaaS solutions provide a complete application development and hosting site delivered as a cloud service. 2
In addition to managing the underlying infrastructure and offering a metered-by-use cost model, PaaS also facilitates application development, testing, deployment, and ongoing maintenance, liberating the customer to focus on managing the application instead of the underlying infrastructure. Customers can write their applications directly to the PaaS platform and then elastically scale out as and when needed. To make this possible, the PaaS programming model imposes some rules on applications:

• A PaaS application must comprise one or more roles. Every application can be broken down into logical divisions; a role is a formalization of these divisions. It includes a specific set of code and defines the environment in which the code runs. The different types of roles can be optimized for specific functions: front end, back end, or otherwise.

• A PaaS application should run two or more instances of each role. These instances are interchangeable, and every role instance runs the exact same code. The expectation is that an instance shouldn't save any data locally—such as session state—that would make it unique.

• A PaaS application provides a description of its topology to the platform, along with its resource requirements. This topology includes the application's roles, the connectivity needs among these roles, and each node's required number of instances. In turn, the platform uses this description to automatically allocate the necessary hardware resources, configure the hardware and software components, monitor the application, and, when necessary, recover failed components.

• A PaaS application behaves correctly when any role instance fails. In case of failure, users aren't affected as they continue to be served through the online instances. Any instance-maintained data or state must be treated only as a cache, and after a failure or a restart, the application must be able to reconstruct the state from elsewhere. Although application performance might be affected, the application will continue to run.

The degree to which these rules are enforced might vary, but PaaS offerings do assume that applications obey all these rules. Adopting the PaaS programming model as well as these rules provides developers with many benefits in the areas of management, availability, and scalability.
In PaaS offerings, the platform itself automatically handles most administrative tasks, such as applying OS patches, installing new versions of system software, and configuring the network. The goal here is to reduce the effort involved in administering the application environment.
PaaS programming models help improve application availability in several ways:

Protect against hardware failures. Because every application comprises multiple instances of each role, hardware failures—a disk crash, network fault, or server death—won't take down the application. To help, different instances of the same application are placed not randomly but in different fault domains. A fault domain is a set of hardware components—computers, switches, and more—that share a single point of failure. Since the platform understands the application's topology, it can effectively allocate the application across fault domains. The application might temporarily lose some instances, but it will continue to behave correctly.

Protect against software failures. The system monitoring and management provided in PaaS also detects failures caused by software. The application can cooperate by monitoring and reporting internal system health status. If the code in an instance crashes or the virtual machine in which it's running goes down, the system will restart the role. The system will also automatically reconfigure the network appropriately and optionally inform the rest of the application of the change. Although any work the instance was doing when it failed will be lost, the new instance will become part of the application as soon as it starts running.

Update applications with no application downtime. Whether for routine maintenance, to install a new version, or even to revert to an older version, every application must be updated. An application built with a PaaS programming model can be updated while it's running. To allow this, different instances for each of an application's roles are placed in different update domains. When a new version of the application needs to be deployed, the platform can shut down the instances in one update domain, update the code, and then create new instances from that new code.

Update the OS and other supporting software with no application downtime. The platform assumes that every application follows the four application rules, so it knows that it can shut down some of an application's instances when required, update the underlying system software, and then start new instances. By doing this in chunks, never shutting down all of a role's instances at the same time, the OS and other software can be updated beneath a continuously running application.

Whether it's planned or not, today's applications usually have downtime for OS patching, application upgrades, hardware failures, and other reasons. PaaS programming models are designed to keep applications continuously available, even in the face of software upgrades and hardware failures.
Applications designed to handle multiple users concurrently are a natural fit for the cloud. A limitation of traditional programming models is that they weren't explicitly designed to support Internet-scale applications. Developers can use PaaS programming models to build the scalable applications that massive cloud datacenters can support. Just as important, these programming models also let applications scale down when necessary, using just the resources they need.
PaaS programming models help developers build more scalable applications in three main ways:

• Automatically creating and maintaining a specified number of role instances. This makes application scalability quite straightforward, especially since cloud platforms run in very large, uniform datacenters where getting the level of scalability an application needs isn't generally a problem.

• Providing a way to modify the number of executing role instances for a running application by offering a Web portal or an API to allow changing the desired number of instances for each role while an application is running.

• Providing highly available and scalable platform services, such as storage, relational database management systems (RDBMSs), caching, and networking services for applications.

These services free the developer from having to implement or manage complicated platform services. For example, to use a database in a traditional system, a database administrator must install RDBMS instances, configure the instances for high availability and backup, and monitor the database for any errors. In a PaaS platform, the RDBMS is provided as a service—the developer needs only to declare the intention of using a database table, and the platform takes cares of the rest.
In addition to the many benefits developers enjoy when PaaS abstracts away the infrastructure layer, the new model also delivers many benefits at the platform layer—namely, in the form of a rich set of services that cloud applications can simply call up. With these capabilities built into the platform, developers need not implement these services themselves, and operators need not run instances of them: the platform can share resources underneath these services, maximizing hardware utilization and driving down costs.
We're in the cloud's early days. Realizing the full benefits of this new paradigm will require rethinking the way we build applications. Because it will take time for practice to catch up with technology, in the meantime we must have a disciplined and thoughtful approach to embracing hybrid models. But ultimately, this shift opens up a new dimension of possibilities in much the same way as the wide adoption of automobiles fundamentally altered 20th-century life.
If we refuse to define this new paradigm in the context of existing technology—when we break free of the whip-holder mentality—businesses and IT departments can shift their focus from managing the technology to innovating, seizing new opportunities, and imagining new scenarios that will move their organizations forward in unprecedented ways.
Selected CS articles and columns are available for free at


Yousef A. Khalidi is a Distinguished Engineer on the Windows Azure team at Microsoft. Khalidi received a PhD in computer science from Georgia Tech and is the holder of 32 patents. Contact him at


[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment:

Computing Now Blogs
Business Intelligence
by Keith Peterson
Cloud Computing
A Cloud Blog: by Irena Bojanova
The Clear Cloud: by STC Cloud Computing
Computing Careers: by Lori Cameron
Display Technologies
Enterprise Solutions
Enterprise Thinking: by Josh Greenbaum
Healthcare Technologies
The Doctor Is In: Dr. Keith W. Vrbicky
Heterogeneous Systems
Hot Topics
NealNotes: by Neal Leavitt
Industry Trends
The Robotics Report: by Jeff Debrosse
Internet Of Things
Sensing IoT: by Irena Bojanova