When an entire system crashes, we call it a failure. If a component of a system fails, that’s a fault. Fault-tolerance is about modeling a system to withhold one or more faults and still meet its purpose without major hurdles.
AWS Lambda infrastructure is highly unlikely to suffer from a widespread failure. But the software we deploy becomes a component of the system, and it can experience faults.
When a software fault happens, the AWS Lambda platform may automatically invoke the function again with the same event payload: called retry-behavior.
And that can be a big problem. Much bigger than the fault itself.
Bear with me along the next lines, as this advice can save you a ton. Lambda won't magically make your functions fault-tolerant, but you can accomplish that by applying a single concept.
Why is that a big issue?
Let's say you're operating an e-commerce site and AWS Lambda is being used to process customer orders. A person purchases an item and you have a function taking care of the following steps, all in a single run:
- Making sure the item is available in stock
- Processing credit card
- Removing item from stock
- Sending confirmation email
Now consider the first three steps completed successfully, but there was a momentaneous issue in sending the email and your application raised an error. Lambda platform automatically invokes the function again, with the same parameters, and the email is sent successfully. Awesome, isn’t it?
Well, not so fast. Our system just registered a second, unintended purchase for the same customer… and charged his credit card twice!
Seldom this process would be implemented like this, but it serves as an illustrative example.
Why AWS Lambda does that?
Lambda retry behavior is actually a very cool feature, don’t get it wrong. In a distributed system, many things can go wrong. In fact, when things can go wrong, rest assured they will go wrong at some point. AWS takes care of making sure these errors aren’t left buried and the operation has a few more chances to succeed. We surely don’t want to miss the revenue of a sale due to a technical issue.
Deploying fault-tolerant code on Lambda
All right, we see value in the retry behavior, but how can we avoid the headaches such as the double charge example?
There’s a concept called idempotence that comes to our rescue. Wikipedia defines it as a “property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application”.
A good practice to combine with idempotency is the separation of concerns. In the previous example, we had several different operations bundled together. If possible, it would be good to have different functions taking care of each operation. One of the reasons is that idempotency needs to be analyzed and implemented from the perspective of the operation.
Read operations usually do not produce any side effects, they’re idempotent by nature. In our example, operation # 1 (check if an item is available in stock) would be an example of that. In most cases, you won’t need to worry about these, so having them implemented separately will make it easier to manage the rest of your stack.
Storing and deleting a value aren’t idempotent operations by nature, but they can be if we have a unique identifier (UID) for that resource. In our e-commerce scenario, if the customer order has a UID, the storing operation can be performed multiple times without creating multiple different order placements.
The order UID could be, for instance, a hash of the customer email or username, the purchase timestamp, and a list of items purchased. These variables would be sent as a parameter to our API when the site receives the order request. If the function fails at some point and the invocation is retried, the same order UID would be generated again, meeting the idempotency requirement. Again, this is just for illustration purposes - each circumstance will require proper analysis to find a stable and resilient idempotent implementation.
For the credit card charging part, most platforms will support idempotent requests. Stripe, for example, will provide an [idempotency key](https://stripe.com/docs/api/idempotentrequests)_, so that you can safely retry a request if something goes wrong in transit.
Usually, if the operation takes place in the realm of your stack, it will be fully on your hands to meet idempotency requirements. The unique identifier principle explained above will usually be enough. But if you’re relying on third-party APIs, it might be tricky to ensure idempotency and you might need help from the other party to accomplish this goal, in case this kind of operation isn’t supported out of the box. If you can’t get the third party to work with you, there’s always the possibility to run all operations on your end first, create a separate process to check whether everything ran successfully, then interact with the external API. This wouldn’t be the ideal implementation but could be as good as one can get in some circumstances.
Managing Lambda retries like a PRO
Dashbird automatically identifies when an invocation is actually a retry from a previous execution. It links the first execution to subsequent retries and you can easily navigate them, as well as browse individual logs, all in one place. This makes it a lot easier to understand why your functions are failing and whether your idempotency strategy is working properly. You can try Dashbird for free by signing up here (no credit card required).
Full disclosure: I work as serverless advocate for Dashbird.