Author Archive
Everything Fails Sometime
Control Group designs cloud-based solutions with the philosophy that every system fails at some point. Embrace this chaos and build for the rainy day. Today we are seeing some major outages on Amazon’s us-east-1 region. Reddit and Quora are two of the high profile victims, but this is affecting everyone in a very popular data center.
You can design around regional performance degradation though. Years ago, having global traffic management in place was an expensive pipe dream. Today you can easily turn up another EC2 region and use a service like Dynect or Akamai GTM to provide failover and/or load balancing. Even better, consider making your systems portable so you can have multiple cloud providers and maintain your machines and applications with Puppet.
3-5 years ago this would have taken a year of planning, purchasing and hands-on labor to implement two data centers. Earlier this year we were able to create two data centers with complex infrastructure on EC2 and active/active load balancing in under two months and for a fraction of the cost.
The Public/Private Debate
I thought Phil Wainwright’s most recent article on private clouds (as well as the first in the series) was an interesting perspective. I share many of the sentiments, but can’t go quite so far as to say the idea of the private cloud is discredited. In the end it depends on the business and it depends on the applications you are hosting. Even Adrian Cockcroft, the writer of the blog that Phil cites as final proof, has updated his post to say:
“…to clarify, that doesn’t mean that I’m against private clouds or don’t think they exist, because $, FUD and internal politics are a fact of life that constrain what can be done.”
Private clouds, whether hosted or self-hosted, can be useful as stepping stones for organizations that have existing applications that may not fit into the public cloud architecture. Some of these applications may require better performance SLA’s. Also, private cloud providers are more amenable to custom arrangements. Try hosting a specialized device like an IPS or IDS in a public cloud where all traffic is guaranteed to only be delivered to the target device. Many enterprise organizations have decades of IT security policies that won’t and perhaps shouldn’t simply go away in favor of adopting a public cloud. Public clouds are secure solutions, but some organizations will have additional requirements, like the ability to discern rogue traffic patterns from typical spikes in demand. You can build this into the individual instances and applications, but that isn’t what most companies have done.
We tend to work with a client to find out what their requirements are and stay away from radical statements. The cloud, public or private, is just one more tool and can’t be seen as a solution in and of itself.
Adventures with Enterprise Firewalls, Elastic IP’s and Auto Scaling
One distinction between our startup and enterprise clients is that enterprise typically brings the baggage of legacy systems. While a startup is designing for a cloud architecture, a company that has a technology history sometimes needs to integrate new systems with existing services.
In a recent engagement Control Group needed to work with a client to have application instances on EC2 communicate with a secured web service in a traditional data center. Typically we would work with a client to move this service to EC2. In this case, because the service is considered to be shared infrastructure that is used and funded by existing applications we needed to design the infrastructure and application to make a call back to a traditional data center.
On a side note, mixed infrastructure approaches are not ideal, but common when migrating complex organizations to IAAS solutions. Most mature IT organizations will shy away from forklifting a company’s technology platforms wholesale into the cloud. The larger the migration, the bigger the bang when something is overlooked. Change too much in an environment and you won’t know where the problems are coming from, so a major part of moving an enterprise customer to the cloud is planning the roadmap of the migration carefully and not being greedy.
One of the technical challenges in this particular project was that the service that we were integrating with requires that traffic originate from a known and registered IP address. Although EC2 will provide an instance with a public IP address, there is no way to know what that address will be ahead of time. We decided to use Elastic IP (EIP) addresses to solve this problem. An EIP functions like a NAT on a traditional firewall. You can allocate the EIP and then associate it with an instance as needed.
EIP’s worked well until we implemented auto-scaling. Auto-scaling groups have no support for associating a pre-allocated EIP to an instance. To implement this we created some scripts that would make the API calls to determine a free EIP and associate it to the instance. (This means that the instance will have temporary access to execute API commands. We’ve designed a fairly secure take on temporarily providing AWS API tools to an instance, but that is a different blog post. Coming soon.)
Here is the real problem with the approach. The script to associate the EIP worked perfectly, so long as multiple machines weren’t executing it at once. The problem is that the Elastic IP API commands do not support a transactional assignment. Worse yet, at least in our use case, it is the last instance requesting the EIP and not the first that gets associated to the IP. This is a major problem if you want to associate EIP’s with members of an auto-scaling group that need to scale up by more than one instance at a time. It will leave you with members of the group that could possibly not have an Elastic IP.
There are a myriad of ways to tackle this issue. We considered options for programatically brokering the IP’s by building an application that would manage the EIP resources. The application would provide an IP on request and then return IP’s that were no longer in use back into the system through a background recovery process. Such a service is pretty easy to write, but it wasn’t in scope for the current project. Also, there are longer-term solutions that we can consider with the launch of the new and improved VPC with NATing.
The current favored approach is to use a proxy server like Squid to limit the number of servers that require IP addresses. Two or more instances with Squid configured as a forward proxy distributed across multiple availability zones and traffic managed by an Elastic Load Balancer to provide HA would provide a redundant and fairly high performance solution. For now, as a work around we have implemented some staggering of the auto-scaling policies as a way to mitigate against multiple instances spinning up at the same time. Staggering is a serviceable solution for testing, but not for production where auto scaling multiple farms of servers that will need access to the client’s data center tier is a requirement. Eventually, we will move forward with the proxy or VPC solution.
In summary, enterprises with complex interdependent applications can lead to interesting challenges when migrating to the cloud. Resources, as simple as IP addresses, can function in a fundamentally different way than a typical IT organization is used to. Oftentimes this can lead to fear, uncertainty, and doubt, but the benefits of Infrastructure as a Service are clear: Ease of provisioning, demand-based resource allocation rather than over provisioning, etc. As long as proper planning, system architecture, implementation, and testing are performed, a complex enterprise can begin making its way to the cloud and begin to eliminate the FUD on the ground.
Enterprise Clients Continue To Warm To The Cloud
Lately we’ve been working with clients that haven’t been the typical EC2 infrastructure consumer. Historically, it has been the startup companies that we work with that have been interested in AWS for all the expected reasons: flexibility, pay-for-what-you-need, access to higher end services like load balancing and HA database deployments, etc. Recently we have been noticing that our more established enterprise clients have taken interest in these capabilities and for largely the same reasons.
Large enterprises looking at cloud infrastructure bring their own requirements and challenges. We plan to write a series of blog posts about Control Group’s experiences with these types of clients and what we learned. Some of the posts will be about the projects and their politics, and some will be about technology approach. There are some interesting technology and organizational challenges that we will discuss, so stay tuned.
