by Pete Mastin
So goes the old joke: ‘The great thing about standards is everybody’s got one.” There is some truth to the fact that one man’s best practices are another man’s legacy architecture that needs to be deprecated.
IT seems to go in large cycles. Mainframes gave way to client-server, which gave way to the Web, which has now given way to the cloud. With the current rush into the cloud, we are seeing some new (or old) issues arise, and some new emerging standards (or best practices) to resolve these new issues.
Guest article by Pete Mastin, Product Evangelist, Cedexis
First, we need to understand some of the emerging best practices for cloud deployments. These come from personal experience with hundreds of clients improving site performance by multi‐homing cloud infrastructure. What are the five irrefutable truths of public cloud infrastructure adoption? While these principles continue to evolve, these have solidified to the point where they are not negotiable:
- Public infrastructure fails and underperforms at times -‐ just like private infrastructure does. The only way to provide 100% availability for the enterprise is to multi-home in an active-active configuration. That is the responsibility of the enterprise.
- Cloud-based technology must be deployed across multiple geo-locations (regions) to maximize uptime in case of “acts of God” such as hurricanes or earthquakes, or “acts of man” such as a backhoe cutting network lines. So, for example, if your user base is primarily North America, you should have a cloud instance on both coasts at a minimum. More is typically better.
- A multiple vendor approach to public infrastructure dramatically reduces the chances of global outages. Unsurprisingly, vendor-specific outages are more common than ‘acts of God.’ Vendor diversity also helps when negotiating contracts with Cloud providers.
- Use of Content Delivery Networks can dramatically improve performance of web/mobile apps. Principle 1 above applies to this piece of infrastructure as well.
- Monitoring every element of a multi-vendor, multi-homed, active-active Web app is critical to maintaining its availability and performance goals.
These five best practices, if followed, will unarguably provide Web and mobile Applications with far better uptime. This has been demonstrated continuously over a number of years. When Hurricane Sandy hit the Northeast United States a few years back, many sites that were singly homed in a data center or cloud just stopped, but many did not, and those were multi-homed.
Qualitative analysis of enterprise cloud deployment
It behooves us to review types of cloud deployments to determine the patterns seen in the marketplace. There are as many varieties of cloud deployments as there are cloud architects. No two are exactly alike.
However, there are some commonalities or broad categorizations that can be made about different types of deployments. Generally, the use of public clouds breaks into four groups:
- Single Vendor / Single Instance
- Single Vendor / Multi-Instance
- Multi-Vendor / Multi-Instance
- Hybrid Cloud
First, let’s explain what we mean by these four categories, so that it is clear what we are comparing and contrasting:
- Single Vendor / Single Instance – Where most enterprises start in their move into the cloud. This refers to selecting a single cloud instance (i.e. AWS East Coast, SoftLayer Houston, or Rackspace Chicago) and deploying your services there alone. These services exist in a single data center on virtualized servers.
- Single Vendor / Multi-Instance – A typical next step for many enterprises, who start to realize the performance penalty that their users pay from other parts of the world, and take steps to rectify that by pushing a portion of their services closer to their user base. Or they have an outage (micro or major) and mitigate risk by deploying a second or third cloud instance of the same provider they already use. A simple example here is the enterprise that is deployed on AWS East Coast, and then decides (based on performance complaints) to deploy similar/same services at AWS Oregon, Frankfurt, and Tokyo.
- Multi-Vendor / Multi-Instance – Enterprises with a more mature set of practices around vendor management and cost control often are found using this category of service. This is basically the same model as the previous one, but rather than deploy similar/same services in other geographies, you’d deploy these services using alternative vendors. An example is the enterprise that starts on AWS East Coast, but then for its west coast services, chooses SoftLayer San Jose, and for its APAC cluster, perhaps Azure Asia East. The main advantage to this model (besides vendor management and cost control) is avoiding vendor-related outages.
- Hybrid Cloud – Commonly seen in companies that have invested heavily in private data centers and want to get the most out of that sunk capital expenditure. In this case, portions of the traffic continue to flow to private data centers, and the cloud takes the rest. Some services also demand bare-metal servers, and to accommodate this while also taking advantage of the scalable capacity of the cloud, companies will adopt this model.
Clearly, the models at the far right are more conducive to maintaining 100% uptime with the best performance characteristics.
Digging into the performance data: A day in the life of four clouds
Clearly, having an active-active, multi-vendor, multi-cloud architecture improves availability. When cloud vendors have outages in one or more of their regional locations, traffic can be diverted to clouds that are still available. But what about performance? Can performance issues be mitigated be using performance-based global traffic management to route traffic to the best-performing clouds?
The Real User Measurements (RUM) data says yes. RUM allows companies to measure every cloud instance from every ISP and network in the world. But it is helpful to use real examples. For what we will discuss below, I took data from four regional clouds in the US:
- AWS EC2 in Virginia
- CenturyLink WA1 in Seattle
- Rackspace Cloud ORD in Chicago
The data was a randomly selected 24-hour period on Aug 3rd and Aug 4th 2015. These various clouds were selected as good examples of cloud instances from different parts of the US. I narrowed my dataset to the US, and six of the larger ISPs:
I also restricted the measurements to six non-covering regions within the US. These are Northeast, Southeast, Midwest, South Central, Southwest, and Pacific Northwest. I did this to be able to broadly see how the various clouds perform from both different regions of the country, and from different networks that exist in those regions.
What does a map look like that shows performance of these four clouds from these six regions across these six ISPs? Just looking at the regions and clouds, we get this (the numbers are latency, so lower is better):
So you can see that from a regional perspective, there is a pretty wide range of latency for the various clouds.
How can we view this data while adding in the ISPs? Lets focus in on the Northeast region to see what this looks like:
There are a couple of things to point out here. First, this is over a specific 24-hour period and if you took these measurements 24 hours later, you would see something different. The numbers do not stay constant, but rather change with the ebb and flow of Internet congestion.
Second, this is taking median latency measurements at the 75th percentile. The coloration of the charts is relative, meaning that given any set of latency numbers, the lowest numbers get a green color and the highest number gets a red.
Generally, you can see that from the Northeast region, the clouds in Virginia and Chicago have lower latency measurements across these networks than the clouds in Seattle and Houston. This can partially be explained by the geography (speed of light), but you can also see that there are notable differences within the set of ISPs. For instance, AT&T has latency from the Northeast of 103ms to Chicago (Rackspace), while Verizon comes in at almost half that number to the same cloud from the same region.
Let’s look at another region to see some of the differences, using the example of the Pacific Northwest:
So here, you can see that from the Pacific Northwest region, generally CenturyLink’s Seattle cloud has the lowest latency from most of the ISPs, with the notable exception of Cox. For some reason on this day, Cox had poor connectivity to CenturyLink’s cloud from within the region. These types of peering problems are common for all the ISPs, and can pop in and out of existence. Verizon likewise had fairly significant latency from this region to this cloud. In fact let’s drill in further on Verizon’s connectivity to all four clouds from this region to make a point:
What you see here is measurements taken from Verizon in the Pacific Northwest to the four clouds under discussion. As you can see, performance to the CenturyLink cloud from a user in the Pacific Northwest was 99 milliseconds, while that same user could have gotten to any of the other three clouds faster. Optimally the user could have been directed to the Rackspace cloud in Chicago and saved almost 30 milliseconds. This is counter-intuitive in some ways, but we see it play out everyday. Peering and network congestion override geography on a daily basis on the Internet.
Lets look at one final example of this. Take the Southeast region, same time period, and same clouds and ISPs:
So broadly, you can see that from the Southeast, the CenturyLink cloud is not a great performer (it’s across the country) and that the clouds in Virginia and Houston are generally the two best. But you can also see that the networks have a huge amount of variability.
So looking at just AT&T from this region and to the four clouds, we see:
We can see that AT&T in the Southeast has the expected poor performance to the cloud in the Pacific Northwest. But the best performer is not the expected Virginia cloud, but rather Houston (a much further distance from the southeastern population centers) — another example of when geography is not a good proxy for performance.
Users on the AT&T network that got routed to Virginia would experience a significantly larger page load time or buffering if they were trying to watch a video from that cloud. The result is even worse if they were routed to Chicago or Seattle. It is critical for optimal Web performance to take networks and performance into consideration when deciding what cloud to route your users to.
The Value of Real User Measurements
As you can see, doing cloud right involves distributing your compute across geographies and across vendors. However, ensuring performance across distributed infrastructure takes something more: having billions of measurements to all the public providers, and being able to route in real-time around network, provider, or peering issues. This is the value of RUM.