How to operate reliable AWS Lambda applications in production

* Latest update: February 22nd, 2017 - added the IteratorAge CloudWatch metric.

AWS Lambda is the leading product when it comes to “serverless” computing, or Function as a Service (FaaS). With AWS Lambda, computing infrastructure is entirely managed by AWS, meaning developers can write code and immediately upload and run it in the cloud, without launching EC2 instances or any type of computing infrastructure.

This is a great thing, as it brings a lot of agility to product development. However, running a reliable Lambda application in production requires you to still follow operational best practices. In this article I am including some recommendations, based on my experience with operations in general as well as working with AWS Lambda.

Let’s start with what I see as rule #1 of AWS Lambda…

Just because AWS manages computing for you, doesn’t mean Lambda is hands-off.

I’ve heard and read many comments that suggest Lambda releases developers from the burden of doing operational work. It doesn’t. Using AWS Lambda only means you don’t have to launch, scale and maintain EC2 infrastructure to run your code in AWS (which is great). But essentially everything else regarding operations remains the same, just packaged differently. Running an application on AWS Lambda that reliably generates revenue for your business requires the same amount of discipline as any other software application.

… and here are some recommendations:

Monitor CloudWatch Lambda metrics

As of today, there are 6 Lambda metrics available in CloudWatch:

  • Duration. This number is rounded up to the nearest 100ms interval. The longer the execution, the more you will pay. You also have to make sure this metric is not running dangerously close to the function timeout. If this is the case, either find ways for the function to run faster, or increase the function timeout.
  • Errors. This metric should be analyzed relative to the Invocations metric. For example, it’s not the same to see 1,000 errors in a function that executes 1 million times a day compared to a function that executes 10,000 times a day. Is your application’s acceptable error rate 1%, 5%, 10% within a period of time? CloudWatch doesn’t support metric math (i.e. Errors/Invocations), therefore you will have to alarm based on a fixed number of errors, using your expected invocations as context - and update this number regularly. So, this is an operational task you have to keep in your plans. Later in this article, I will cover Lambda’s retry behavior in case of errors.
  • Invocations. Use this metric to determine your error tolerance, as mentioned above. If your Invocations change, your alarming on Errors should change as well, in order to keep your error tolerance % constant. This metric is also good to keep an eye on cost: the more invocations, the more you will pay. To forecast pricing, consider not only invocations but also the memory you have allocated to your function, since this impacts the GB-second you will pay for your functions. Also, when do zero invocations start to tell you there’s something wrong? 5 minutes, 1 hour, 12 hours? I recommend setting up alarms when this number is zero for a period of time. Zero invocations likely means there is a problem with your function trigger.
  • Throttles. So you have a popular function, right? If you expect your function execution to be above 1000 TPS, or above 100 concurrent executions, then submit a limit increase in AWS - or you risk experiencing throttled executions. This should be part of your regular capacity planning. I recommend to set up alarms when the Invocations metric for each function is close to the number you have assigned in your capacity planning exercise. If your function is being throttled, that’s obviously bad news, so you should alarm on this metric.
  • DLQ Errors. Lambda gives you a great feature called Dead Letter Queue. Basically it allows you to write the payload from failed executions to an SQS queue or SNS topic of your choice, so it can be processed or analyzed later. If for some reason you can’t write to the DLQ, you should know about it. That’s what the DLQ Errors metric tells you. Lambda increments this metric each time a payload can’t be written to the DLQ destination.
  • IteratorAge. You have Lambda functions that process incoming records from either Kinesis or Dynamo DB streams. You want to know as soon as possible when records are not processed as quickly as they need to. This metric will help you monitor this and prevent your applications from building a dangerous backlog of unprocessed records.

Allocate the right capacity for your function

Do you see anything wrong with this message?

Duration: 702.16 ms Billed Duration: 800 ms Memory Size: 512 MB Max Memory Used: 15 MB

This log entry is telling you that you’re paying for over-provisioned capacity. If your function consistently requires 15MB of memory, you should allocate 128MB (the minimum possible), not 512MB. Here is a price comparison between the two configurations, assuming 100 million monthly executions:

Memory (MB) 100 million x 800ms
128 $166.4
512 $667.2

If you think 100 million executions is a large number, you’re right. But once you start using Lambda seriously, for processing CloudTrail records, Billing, Kinesis, S3 events, API Gateway and other sources, you will see that executions add up really fast and you’ll easily reach 100 million executions.

If you were using this function at a rate of 100 million executions per month, you would pay approximately $8,000 per year instead of $1,996. That’s money you could use on more valuable things than an over-provisioned Lambda function.

Use CloudWatch Logs Metric Filters

The AWS Lambda service automatically writes execution logs to CloudWatch Logs. CloudWatch Logs has a very cool feature called Metric Filters, which allow you to identify text patterns in your logs and automatically convert them to CloudWatch Metrics. This is extremely handy, so you can easily publish application metrics to CloudWatch. For example, every time the text “submit_order” is found in CloudWatch Logs, you could publish a metric called “SubmittedOrders”. You can then create an alarm if this metric drops to zero within a period of time.

Something very important about using Metric Filters is that as long as there is a consistent, identifiable pattern in your Lambda function output, you don’t need to update your function code if you want to publish more custom CloudWatch metrics. All you have to do is configure a new Metric Filter. Even better, Metric Filters are supported in CloudFormation templates, so you can automate their creation and keep track of their history.

When something fails, make sure there is a metric that tells you about it

When it comes to operations, nothing is more dangerous than being blind to errors in your system. Therefore, you should know there are error scenarios in Lambda that don’t result automatically in an Error metric in CloudWatch.

  • Python. Unhandled Exceptions result automatically in a CloudWatch Error metric. If your code swallows exceptions, there will be no record of it and your function execution will succeed, even if something went wrong. Logging errors using logger.error will only result in an [ERROR] line in CloudWatch Logs and not automatically in a CloudWatch metric, unless you create a Metric Filter in CloudWatch Logs that searches for the text pattern “[ERROR]“.
  • NodeJS v4.3. If the function ends with a callback(error) line, Lambda will automatically report an Error metric to CloudWatch. If you end with console.error(), Lambda will only write an error message in CloudWatch Logs and no metric, unless you configure a Metric Filter in CloudWatch Logs.
  • NodeJS v0.10.42 If you don’t call context.succeed(Object result) or context.done(Error error, Object result) to indicate function completion, you will get the infamous “Process exited before completing request” error. At least this error results automatically in a CloudWatch Error metric, but often lacks context for effective troubleshooting. If you end the execution with context.fail(Error error) you will also get an automatic CloudWatch Error metric.

You can also use Metric Filters to identify application errors.

Know what happens “under the hood” when a Lambda function is executed

AWS uses container technology that assigns resources to each Lambda function. Therefore, each function has its own environment and resources, such as memory and file system. When you execute a Lambda function, two things can happen: 1)a new container is instantiated, 2)an existing container is used for this execution.

You have no control on whether your execution will run on a new container or an existing container. Typically, functions that run in quick succession are executed on an existing container, while sporadic functions need to wait for a new container to be instantiated. Therefore, there is a difference in performance in each scenario. The difference is typically in the millisecond range but it will vary by function.

Also, it’s important to differentiate between 1)function and 2)function execution. While a function has its own isolated environment (container), multiple function executions can share resources allocated to their respective function. Therefore, it is possible that function executions access each other’s data.

I recommend the following:

  • Run a load test for your particular function and measure how long it takes to execute during the first few minutes, compared to successive executions. If your use case is very time sensitive, container initialization time might become an operational issue.
  • Never use global variables (those outside your function handler) to store any type of sensitive data, or data that is specific to each function execution.

Treat event triggers, functions and final targets as a single environment (dev, test, prod, etc.)

In an event-driven architecture, Lambda functions are not isolated components. Typically Lambda functions are an intermediate step between an event and a final destination. Lambda functions can be automatically triggered by a number of AWS services (i.e. API Gateway, S3 events, CloudWatch Events, etc.) The Lambda function either transforms or forwards request data to a final target (i.e. S3, Dynamo, Elasticsearch, etc.) Some events contain only a reference to the data, while other events contain the data itself.

Something like this:

Lambda events chain

Ideally, there should be independent stages that contain their own set of AWS components for events, functions and data stores. For most system owners this is an obvious point. However, I’ve seen operational issues that stemmed from not following this basic principle.

This is probably due to how quickly it is to build a service from the ground up using Lambda, that it’s also easy to forget about operational best practices.

There are frameworks you can use to alleviate this problem, such as Serverless, Chalice or ClaudiaJS. You need to keep track of each component’s version and group them into a single environment version. This practice is really not too different from what you would need to do in any service oriented architecture before Lambda.

There is also the AWS Serverless Application Model, which allows you to define multiple components of your serverless application (API Gateway, S3 events, CloudWatch Events, Lambda functions, Dynamo DB tables, etc.) as a CloudFormation stack, using a CloudFormation template.

Don’t use the AWS Lambda console to develop Production code

The AWS Lambda console offers a web-based code editor that you can use to get your function up and running. This is great to get a feel of how Lambda works, but it’s neither scalable nor recommended.

Here are some disadvantages of using the Lambda console code editor:

  • You don’t get code versioning automatically. If you make a bad mistake and hit the Save button, that’s it, your working code is gone forever.
  • You don’t get integration with GitHub or any other code repository.
  • You can’t import modules beyond the AWS SDK. If you need a specific library, you will have to develop your function locally, create a .zip file and upload it to AWS Lambda - which is what you should be doing from the beginning anyways.
  • If you’re not using versions and aliases, you’re basically tinkering with your LIVE production code with no safeguard whatsoever! Did I mention there is no version control?

Test your function locally

Since you shouldn’t use the Lambda console code editor, you’ll have to write your code locally, package it into a .zip file and upload it to Lambda. Even though you can easily automate these steps, it’s still a tedious process that you’ll want to minimize.

That’s why you’ll want to test your function before you upload it to AWS Lambda. Thankfully, there are tools that let you test your function locally, such as Python Lambda Local, or Lambda Local (for NodeJS). These tools let you create event and context objects locally, which you can use to create test automation scripts that will give you a good level of confidence before you upload your function code to the cloud.

You should consider these local tests as your first gate, but not your only one. And this takes us to the next point…

Automate integration tests and deployments, just like any other piece of software

With AWS Lambda, you can implement a typical Continuous Integration flow and automate it using a service such as AWS Code Pipeline or any CI tool of your choice. A common flow would look like this:

Lambda events chain

Understand the retry behavior of your architecture in case of function failure

You can configure a number of AWS triggers to invoke your Lambda function. But what happens when your Lambda execution fails? (note I use the word “when” and not “if”). Before you decide if a particular function trigger is a good fit for your application, you must know how they handle Lambda function failures.

A Lambda function can be invoked in two ways, which result in a different error retry behavior:

  • Synchronously. Retries are responsibility of the trigger.
  • Asynchronously. Retries are handled by the AWS Lambda service.

Here is a summary of AWS triggers and what they do if their corresponding Lambda function fails:

AWS Trigger Invocation Failure Behavior
S3 Events Asynchronous Retried at least 3 times for 24 hours
Kinesis Streams Synchronous Retry until success. Blocks stream until success or data expiration (24 hours to 7 days)
SNS Asynchronous Up to 50 retries
SES Asynchronous (N/A in AWS documentation)
Cognito Synchronous (N/A in AWS documentation)
Dynamo DB Streams Synchronous Retry until success. Blocks stream until success or data expiration (24 hours).
API Gateway Synchronous API Gateway returns error to client.
CloudWatch Logs Subscriptions Asynchronous (N/A in AWS documentation)
CloudFormation Asynchronous (N/A in AWS documentation)
CloudWatch Events Asynchronous (N/A in AWS documentation)
AWS SDK Both synchronous and asynchronous Your application specifies retry behavior

You can also use this information to choose the right criteria for CloudWatch Alarms. For example, a single failure that blocks a whole Kinesis stream is likely a serious issue and you might want to lower your alarm threshold. But you might want to have a different alarming criteria for a single failure in a function triggered by SNS, when you know it will be retried up to 50 times.

You can read more about Lambda event sources here.

In case of failure, don’t forget to use Dead Letter Queues

Dead Letter Queues allow you to send the payload of failed Lambda executions to a destination of your choice, which can be an SQS queue or an SNS topic. This is great for failure recovery, since you can reprocess failed events, analyze them and fix them. Here are some examples where DLQs could be very useful:

  • Your downstream systems fail, which makes your Lambda execution to fail. In this case, you can always recover the payload from failed executions and re-execute them once your downstream systems recover.
  • You encounter an application error, or edge case. You can always analyze the records in your DLQ, correct the problem and re-execute as needed.

And there’s also the DLQ Errors CloudWatch metric, in case you can’t write payloads to the DLQ. This gives you even more protection and visibility to quickly recover from failure scenarios.

Don’t be too permissive with IAM Roles

As you might know, when you create a Lambda function you have to link an IAM Role to it. This IAM role gives the function permissions to execute AWS APIs on your behalf. In order to specify which permissions, you attach a policy to each role. Each policy includes which APIs can be executed and which AWS resources can be accessed by these APIs.

My main point is, avoid an IAM access policy that looks like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*"
    }
  ]
}

A Lambda function with this policy can execute any type of operation on any type of AWS resource, including accessing keys in KMS, creating more admin IAM roles or IAM users, terminating all your EC2 instances, accessing customer data stored in Dynamo DB or S3, etc. Let’s say you have a Lambda function under development with this access policy. If that’s the case, you’re basically opening a door to all your AWS resources in that account, to any developer or contractor in your organization.

Even if you trust 100% the members of your team (which is OK) or you are the only developer in your AWS account, an over-permissive IAM Role opens the door to potentially devastating, honest mistakes such as deleting or updating certain resources.

Here are some ways to minimize risks associated with granting IAM permissions to your Lambda functions:

  • Not everyone in your company should have permissions for creating and assigning IAM Roles. You just have to be careful and avoid creating too much bureaucracy or slowing down your developers.
  • Start with the minimum set of IAM permissions and add more as your function needs them.
  • Audit IAM Roles regularly and make sure they don’t give more permissions than the Lambda function needs.
  • Use CloudTrail to audit your Lambda functions and look for unauthorized calls to sensitive AWS resources. I created a CloudFormation template for this - you can read more about it in this article.
  • Use different accounts for development, test and production. There is some overhead that comes with this approach, but in general it is the best way to protect your production environments from unintended access or privilege escalation.

Use versions and aliases

As your function evolves, AWS Lambda gives you the option to assign a version number to your function at that particular point in time. You can think of a version as a snapshot of your Lambda function. Versions are used together with aliases, which are a name you can use to point to a particular version number. Versions and aliases are very useful as ways to define the stage your function code belongs to (i.e. DEV, TEST, PROD)

Something like this:

Lambda events chain

By using versions and aliases, you can promote your code between test stages, test it and promote it to PROD when you’re ready. This process can be automated, using the AWS API and using Continuous Integration tools. All of this can make your deployment process less painful as well as reduce human error in your operations.

Using versions and aliases is recommended, even if you use different AWS accounts for development, test and production (continue reading).

Use Environment Variables to separate code from configuration

If you’re building a serious software component, most likely you already avoid any type of hard-coded configuration in your code. In the “server” world, an easy solution is to keep configurations somewhere in your file system or environment variables and access those values from your code. This gives a nice separation between application code and configuration and it allows you to deploy code packages across different stages (i.e. DEV, TEST, PROD) without changing application code - only configurations.

But how do you do this in the “serverless” world, where each function is stateless? Thankfully, AWS Lambda offers the Environment Variables feature for this purpose.

Make sure you can quickly roll back any code changes

Deploying code to production is NEVER risk free. There’s always the possibility of something going wrong and your customers suffering as a result. While you can’t eliminate the possibility of something bad happening, you can always make it easier to roll back any broken code.

Versions and aliases are extremely handy in case of an emergency rollback. All you have to do is point your PROD alias to the previous working version and that’s it. No need to checkout your previous working code, zip it and re-deploy it to AWS Lambda.

Test for performance

AWS promises high scalability for your Lambda functions, but there are still resource limits you should be aware of. One of them is that you can’t have more than 100 concurrent function executions. Each execution has limits too. There is a limit to how much data you can store in the temporary file system (512MB) or the number of threads (1,024). The maximum allowed execution time is 300 seconds and there are limits to the request and response payload size (6MB).

Therefore, I strongly recommend you identify both steady and peak load scenarios and execute performance tests on your functions. This will give you confidence that your expected usage in Production doesn’t exceed any of the Lambda resource limits.

Also, when executing performance tests, you should quantify the execution time and frequency, so you can estimate the monthly cost of your function, given your expected usage in Production.

Estimate pricing at scale

AWS Lambda offers 1 million free executions, and each additional million costs only $0.20. When it comes to price, Lambda is a no-brainer, right? Well, not really. There are situations where Lambda pricing can be substantially higher compared to running your workload using EC2. If you have a process that runs infrequently, then Lambda will always be cheaper compared to EC2. But if you have a resource-intensive process that runs all the time, at high volume, that’s when going serverless might cost you more compared to EC2.

Let’s say you have a function that runs at a volume of 100 transactions per second (approximately 259 million executions per month). Each execution consumes 1000ms and requires 512MB of memory. You would pay approximately $2,162/month for this function. Let’s say that you can handle the same workload with a cluster of 10 M3.large instances, in which case you would pay $950/month.

The difference? A single Lambda function would cost you $14,500 more per year.

The following table shows the monthly cost of a function that executes at 100 TPS, based on different combinations of execution time and assigned memory (not counting the free tier).

ms 128MB 512MB 1024MB
100 $53.9 $216.2 $432.1
500 $269.6 $1080.9 $2160.4
1000 $539.1 $2161.7 $4320.9

Conclusions

  • Lambda is a great AWS product. Removing the need to manage servers is a great thing, but running a reliable Lambda-based application in Production requires operational discipline - just like any other piece of software your business depends on.
  • Lambda (aka FaaS or serverless) is a new paradigm and it comes with its own set of challenges, such as configurations, understanding pricing, handling failure, assigning permissions, configuring metrics, monitoring and deployment automation.
  • It is important to perform load tests, understand Lambda pricing at scale and make an informed decision on whether a FaaS architecture brings the right balance of performance, price and availability to your application.

Ernesto Marquez

ErnestoMarquezProfilePic

I am the Project Director at Concurrency Labs Ltd, ex-Amazon (AWS), Certified AWS Solutions Architect and I want to help you run AWS optimally, so your applications reliably generate revenue for your business.

Running an optimal AWS infrastructure is complicated - that's why I follow a methodology that makes it simpler to run applications that will support your business growth.

Do you want to learn more? Do you have other questions related to AWS? Click on the button below to schedule a free 30-minute consultation.

Do you have any comments or questions about this post, or my services?