14 Common Mistakes That Will Derail Your Application's Growth on AWS.

You’ve built and launched a great product. Customers are liking it and you’re getting some nice usage growth. Your revenue targets are looking good. What can go wrong?

Well, if you make some of these mistakes below, there are a LOT of things that can go wrong as you grow your application.

OK, but what does it mean to grow an application?

An application that is growing successfully has the following characteristics:

Obviously, it has an increasing number of users.
It can handle this increasing number of users without degrading their experience. The application responds fast to more transactions and it doesn’t crash.
The cost of running the application’s infrastructure is kept under control. Cost will increase with usage, but at a sustainable rate for the business. There might be inefficiencies at the beginning, but in the long term it should achieve an economy of scale.
There are no security vulnerabilities introduced due to new features or the overall chaos that comes with fast growth.
New features and fixes can be delivered quickly to your users.
The team behind the application doesn’t grow proportionally with usage growth.
There’s a point where the application MAKES MONEY. Without this ingredient, nothing else matters.

Let’s get started…

#1 Biggest Mistake: thinking that growing an application is ONLY about transaction volume

"This is AWS, I have an elastic architecture, don’t you see?", “I can add bigger servers when I need them”, “My app can handle 1,000,000 customers”, “I already have low response times”, “Of course, I’m using Auto Scaling!", “Dynamo DB will handle it!“

These are comments that typically come up in conversations about growing applications on AWS. Of course you want AWS components that will scale and you want your applications to handle a lot of transactions, fast. You want your customers to get a great experience when they use your product, at any volume.

Volume and performance are essential when it comes to growing an application, but they’re not the only factor. Thinking that growing an application is only about volume and ignoring other factors is possibly the most common and dangerous mistake a team could make.

There are many other factors, besides transaction volumes, that can turn into big headaches when your applications grow. If you ignore them, your applications could be at risk.

Runner-up: Ignoring business goals and the systems that support them

One of the first steps for growing an application anywhere is to be clear about your business flows and goals. Equally important is to know how your systems support each one of those goals. What are the main business flows that take place in your application? How important are they to your business goals? How much will these transactions grow in the next 3, 6, 12, 36 months? Which system resources are consumed by each business transaction? What are their usage patterns?

Ignoring these questions will easily lead to choosing the wrong AWS services or the wrong configurations, which invariably results in a lot of pain for you, your team and your customers.

Over-engineering or over-thinking things too early

I’m a technical person. I’ve gotten caught many times in the trap of over-engineering things too early: "I need to handle every single edge case for every single component in my application”, “I should have a backup for the backup”, “I need to build this automatic process to recover in case X system fails”, “I need my system to handle 1 million transactions per hour", etc.

At the beginning, your priorities should be to build a great product, launch it, find customers and keeping them happy. Chances are you’re just getting started and unless you’re Amazon or some other tech giant, you don’t need to have every single edge case and failure scenario solved from day-1. That being said, you also need to make sure there are no structural, high impact inefficiencies that will give your customers a bad experience or cripple your growth early on.

Not designing for failure.

You don’t want to over-engineer things at the beginning. But you also need to be prepared for failure. Here are two critical areas to consider:

Minimize likelihood of failure. This is achieved by doing proper testing of your applications as well as deployment steps. Having a simple checklist with all your deployment steps can go a long way towards reducing human error. After that, you can always build deployment automation processes. You should also consider using multiple Availability Zones or DB read replicas from day-1. They’re not complicated to configure at all when you use services such as RDS, Auto Scaling or ELB.
Reduce time to recovery. Failures will happen, either internally or externally. That’s just a fact of life. You sure need to think about which tools and processes to put in place to make your applications recover as quickly as possible. These include: escalation, up-to-date documentation, automated recovery, updated runbooks, good logging, appropriate monitoring and alarming, ticketing systems. You don’t need to have the most sophisticated tools and procedures in place from day-1. At the very least you need good logging, monitoring and alarming and then build from there.

Don’t properly manage Identity, Access and Security from day-1

Your team will grow, you’ll have more integrations with other systems, you’ll use more API keys, you’ll create more IAM Policies, Roles and Users, you’ll set up more EC2 Security Groups, Network ACLs, you’ll need to manage more encryption keys and certificates.

If not managed properly, all these things add up and turn very quickly into a mess. A dangerous one. Given security is the factor with potentially the most devastating consequences, I really recommend that you don’t loose sight on security best practices from day-1 and as you grow.

Poor management of “technical debt”

I’ll use an analogy from the financial world: when you borrow money, do you prefer to get a loan at a 5% annual interest rate or at a 50% rate?

You see the point? The same principle applies to technical debt. It’s OK to cut some corners here and there and leave some issues unresolved when you launch, if you only have 10 customers. If you’re borrowing this technical debt at a 5% rate, you can always fix them later. It’s not OK to leave structural problems unresolved if they will be extremely painful to fix later. Launching with structural problems that will require a re-design or intrusive actions in the future is what I call borrowing technical debt at a 50% interest rate.

Borrow technical debt at 5%, as much as possible. But be aware that you can max out your credit limit in the technical debt bank, even at a 5% interest rate. Know how much your team will be able to fix in the future and manage your “technical debt” accordingly.

Not monitoring enough metrics

It may sound cliche, but the quote “You can’t manage what you can’t measure” is extremely important here. Not having visibility on important metrics is a recipe for disaster. Thanks to the cloud, it’s easier than ever to monitor metrics.

I divide metrics in the following categories:

Business Metrics. They tell you if you’re still making money (or not). They measure in real time your performance against business goals and objectives. Examples: orders submitted, product catalog views, etc.
Customer Experience Metrics. They tell you what your customers see, if they’re having a good day using your application (or not). They reflect “symptoms” when things are not going well. Examples: response times, page load times, error 500s, etc.
System Metrics. They tell you the root causes behind Customer Experience metrics. They also tell you if your systems are healthy, at risk, or already creating a bad experience for your customers. Examples: CPU Utilization, memory utilization, disk I/O, queue length, etc.

Create dashboards and make sure you monitor these metrics all the time. Thanks to services like CloudWatch, collecting metrics is so easy these days. Therefore you should have as many metrics as possible from day-1.

Choosing the wrong AWS region

A very common mistake is to choose an AWS region simply based on the location of your customers. While that’s a great start, that’s by far not the only factor to consider. Not all AWS regions are created equal, there are substantial differences in price and feature availability.

If you think you’ll use a certain AWS service in the near future, make sure it’s available in your chosen region. For example, it took about 2 years before Lambda was available in the N. California region. Imagine if you were counting on Lambda being available in this region back in 2015, you would have waited a long time for it. For example, new regions like Montreal and London didn’t have Lambda when they were first announced, and it might take some time before it’s available.

Some configurations can cost almost double, depending on the region you choose. Data transfer in Sao Paolo is 177% more expensive compared to N. Virginia. Even within the US, some EC2 instances (like a t2.large) in N. California will cost you 30% more compared to N. Virginia.

For more on AWS regions, read this article.

Not knowing your AWS infrastructure limits (way) ahead of time

The AWS infrastructure you provisioned on day-1 will eventually reach a limit and crash. Not knowing what that limit is and when will it be reached is a sure way to hit a wall.

I recommend the following:

Execute load tests. Thanks to the cloud and the number of open source tools available today, there’s really no excuse to skip load tests. You should know exactly what type of traffic your applications can handle with your current AWS components. You should know what the customer experience will be and you should know at which point things will start to break. You should have a very good reason to avoid load tests before launch. But if you decide to launch without executing load tests first, you should really make them a priority right after launching.
Do Capacity Planning. Based on load test results and your growth goals, you should know well in advance when it’s time to beef up your AWS infrastructure. Don’t wait until your customers are having a bad day using your product. And above all things, don’t wait until 10,000 customers are having a bad day!

Not predicting and controlling AWS cost

An application that costs more than it’s worth will eat your profit. It’s that simple. Don’t be that founder that gets an unexpected $10,000 AWS charge at the end of the month. Especially when you only have 10 customers! I’m sure you’d prefer to use your money on other important things - like acquiring customers, for example.

AWS pricing is complicated, but not impossible to calculate. You just need to know the type of resources your application consumes, their quantity and their corresponding AWS price dimension. For example: data transfer (out to the internet, inter-AZ, inter-region), ELB data processed, instance hours, EBS/S3 storage, billable API calls, Lambda executions and memory, etc.

Load tests are a great opportunity to calculate your future AWS bill. A common mistake is to ONLY use load tests to measure response times, throughput and system metrics. When doing load tests, you should also calculate AWS cost for usage today and in the future. You might find some very expensive inefficiencies that you can correct now and avoid that big sticker shock when your application grows. These predictions will help you make cost saving decisions, such as buying Reserved Instances, configuring appropriate S3 Storage Classes or choosing more optimal EC2 instance types, among others.

Avoiding automation

Not having automation in place is the best way to guarantee your team won’t scale.

You don’t need to have automation mechanisms for every single task, but you need to have some automation and gradually add tasks. Don’t wait until your team is overwhelmed with manual, tedious, time sucking activities.

I recommend to regularly do the following exercise:

Ask your team about the most painful manual tasks they execute. Measure by time spent and level of frustration.
Based on that, sort those tasks from most urgent to less urgent.
Identify an automation task for each each manual task and estimate how much time it would take to implement.
Make a decision. Prioritize automation for those tasks your team hates and where you’ll get the biggest bang for your buck.

You don’t need the most detailed project plan, just a list of painful tasks and how to fix them. It shouldn’t take long to figure out the best candidates for automation.

Not using CloudFormation

CloudFormation has a bit of a learning curve and some templates can take time to build. But it’s worth it. Not using a tool like CloudFormation means you’ll have to manually create environments and it also means you won’t have a nice way to keep track of configuration changes.

Imagine you can create a big production-like environment in less than 10 minutes, run some load tests, shut it down and launch it again in the morning for more testing. Or rollback some configuration change that didn’t work well in production, in minutes.

CloudFormation will save you an enormous amount of headaches - not only when your application grows, but before your initial launch. Not using CloudFormation is asking for trouble in the not too distant future.

Don’t do Continuous Integration/Delivery/Deployment

This is another way to guarantee your team won’t scale.

You don’t need to have the most sophisticated workflows in place from day-1, but you need to at least have a pipeline, which you’ll gradually upgrade. Once you start getting more feature requests, bugs, and your codebase grows, it will be more difficult to implement a good foundation.

With AWS products such as CodeCommit, CodeBuild, CodeDeploy and CodePipeline, there’s really no excuse for postponing at least some basic code automation.

Avoid updating documentation

I find writing documentation just as boring as anyone else. We all want to build exciting stuff, not write documents, right? Well, not documenting critical areas of your applications is a sure way to inflict pain on your team. And just in case, let me clarify that source code doesn’t count as documentation.

You don’t have to document everything from day-1, just have the right tools in place and gradually build a knowledge base. Do this every day and you’ll be grateful you did so at an early stage. Remember, that senior engineer who knows everything about your app might not be with you next month. Or simply won’t have time to teach others when you grow your team and your applications.

Some recommendations:

Runbooks. Wikis, shared documents or runbook automation tools like Ansible, Chef, Puppet, etc.
Operational procedures. Wikis, shared documents.
Checklists. It doesn’t matter how much you automate, there will always be manual tasks. Just make sure they’re documented somewhere. Checklists are a great way to document steps and to reduce human error.

To recap

Growing an application on AWS is not only about handling high transaction volumes. There are many other factors to consider.
Business flows and goals should drive how you scale your systems.
Don’t over-engineer things too early. Unless you already have a lot of customers.
Design for failure: reduce its likelihood and reduce recovery time.
Manage technical debt wisely. Avoid “death by a thousand cuts” and don’t leave structural problems unresolved.
Monitor as many metrics as possible, from day-1.
Choose the right AWS region for your applications. Consider user proximity, cost, redundancy options and feature availability.
Know your AWS infrastructure limits. Execute load tests and do capacity planning.
Predict AWS cost from the beginning. Avoid sticker shock and save potentially thousands of dollars.
Don’t forget to automate. Start simple and add automation tasks gradually.
Use CloudFormation. It has a learning curve and it takes some time to build templates, but it’s really worth it.
Do Continuous Integration/Delivery/Deployment as early as possible. Start simple and add more features.
Manage Identity, Access and Security from day-1.
Document your applications. Start simple and gradually build a knowledge base. Your future team members will thank you for it.