How to mitigate the automatic weekly RDS instance restarts

You want to automatically stop an RDS instance that has been restarted after "exceeding the maximum allowed time being stopped" so that developers can maintain quickly accessible test setups at reasonable cost without the need to wait for the potentially fairly long restoration of new instances from large snapshots.

Concepts

TBD

Implementation

This provisions three conceptually independent components via an embedded stack for convenience (they can also be deployed independently for advanced use cases):

  1. A Lambda function that is indirectly subscribed to RDS event notifications of type 'db-instance' via an SNS topic. The function ingests all these events as custom events into CloudWatch Events to allow using its convenient and unified rules engine instead of custom code.

    • AWS is in the process of migrating all event processing to CloudWatch and will also provide native CloudWatch events for RDS at some point, which will render this component obsolete, i.e. it only provides a temporary workaround.
      Remove component in favor of the recently released RDS CloudWatch Events @Steffen Opel [Utoolity]

  2. A CloudWatch Events rule that matches only the RDS 'db-instance' events with the message 'DB instance is being started due to it exceeding the maximum allowed time being stopped'.

  3. A Step Functions state machine that is triggered by the matched CloudWatch event. The state machine will wait a configurable time and then stop the RDS instance via another Lambda function (default wait is 48 minutes, i.e. a bit shorter than the instance hour that is payed already).

    • Ideally this would stop the instance the moment it has been fully started, but tracking the instance state across several events would be more complex, so this is done he easy way for starters.

Notes

There are a couple of things worth mentioning:

  • The Lambda functions are not exactly robust yet, i.e. they seem to work fine for the 'happy path', and do not explode on error, but proper logging and exception handling looks differently ...

  • Turns out there can be a surprisingly long delay between RDS events showing up in the console and being emitted as SNS messages (up to several minutes) - not a problem for the use case of course, just to keep in mind when debugging the solution.

  • Turns out only DB instances that are provisioned in a single availability zone (i.e. not 'Multi-AZ') can be stopped

Verify solution also works with the recently released RDS stop capability for Multi-AZ instances – @Steffen Opel [Utoolity]

Step-by-step guide

Provision the rds-automatic-restart-mitigation.yaml CloudFormation template in the desired region(s):

  •  Conceptually this should be a StackSet with stack instances in all regions where you want to use RDS – refer to How to provision a CloudFormation StackSet for details.

  •  Fetch a coffee after initiating the stack creation, wiring the RDS events can surprisingly take up to ~8 minutes apparently

  1. TBD