When we rent a server, there's a cap on resources available: CPU, RAM, local storage. If we use 10% or 100% of that, doesn't really matter. The bill at the end of the month will always be the same (unless you unleash an auto-scaler to instantiate servers to the roof).

Spikes in memory or execution duration, for example, due to a bug or poorly designed code, don’t make costs skyrocket. Although our users will probably get mad at a degraded experience, many don't fear the billing statement, because it's somewhat predictable.

But that behavior is not good...

Woman walking on rope

Implications to modern cloud infrastructures

Since a few years ago, I've been diving deep into the AWS serverless stack: Lambda, S3, DynamoDB, Athena, etc. These services are beasts. They can scale relatively quickly to absorb big, really big loads.

Many teams are adopting serverless for a variety of reasons. Since these services can scale easily and quickly, unworried behavior can become a big problem.

  • What happens if a Lambda function starts taking 3x more to execute?
  • How much memory should I allocate to a function?
  • What if my Lambda enters an infinite loop with S3 or another integrated service?

The cloud bill can easily skyrocket and we can waste money in situations like these.

NASA Space Rocket

Big loads can come in two flavors

  1. The app popularity is flying, which, in theory, will also bring big revenue;
  2. Performance is not on par with the expectations;

What we need is to avoid # 2. Easy, just double-check to make sure the code is performant. Ok, but how about things that are outside our control that can affect the overall resource consumption?

That was a lesson I've learned while developing a natural language processing project relying heavily on AWS Lambda. Part of the process involved scraping and mining data from public sources, as well as retrieving large volumes of information from third-party APIs for data enrichment.

That means a lot of IO-bound tasks depending on third-party systems!

Wire connections

You can see where I'm heading: AWS costs depend heavily on how fast (or slow) these sources can answer our requests. And yes, if they start performing badly… ouch, my AWS bill would hurt.

Ok, so I can't control third-party systems, but at least we need to know if and when things go south so that we can anticipate appropriate countermeasures.

AWS offers CloudWatch Logs and Metrics. Those are great services, but they provide only half the visibility I needed. I later found a service that builds upon those from AWS and perfectly filled the missing gaps.

Person in the dark

A big difference is that CloudWatch was meant for EC2, RDS, traditional server infra. The one I mentioned above was built from the ground up thought for serverless stacks. So, yeah, kind of unfair comparison...

By extracting more value from the data already available in CloudWatch, it delivers everything I need in a very convenient way. One thing I love in this tool is that I can easily set multiple performance thresholds.

Alerting Configuration

These serverless performance policies put me on top of my stack.

I can track whether:

  • My functions are being invoked more frequently than expected;
  • Execution time is too long;
  • Percentage of errors is unusually high;
  • Consumption of memory is too high or too low;
  • And virtually anything else a developer might need to monitor...

Each of these items can have big impact on the AWS bill, and having this automated monitoring system running on my behalf, applying thresholds customized for my use case, allows me to sleep in peace.

Cat sleeping

Automated alerts are also invaluable. Whenever my app shows the first signs of bad shape, I'm proactively alerted to take action upon the issue. Since it integrates logs, X-Ray traces, and metrics, I have everything in one place to:

  • Identify the issue;
  • Understand its repercussions;
  • Find the root cause;
  • Patch a fix or mitigation measure;

How do you monitor the performance of your serverless stack? Or, if not, how do you sleep, my friend? Please share in the comments.

Full disclosure: I loved the Dashbird service so much that I now work as an advocate for it. ;)

Photo credits:

This post is also available on DEV.