💾 Archived View for capsule.adrianhesketh.com › 2022 › 03 › 02 › aws-go-cdk-lambda-error-alerting captured on 2022-04-28 at 17:21:36. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
In my last post [1], I covered setting up dead letter queues to alert on message handling failures in Lambda.
In this post, I'll cover two other areas:
While this post is focused on alerting on errors in logs, I'm also interested in logging things that aren't errors.
Sometimes people will ask "Is everything working?", or "Is everything alright with the system?", or "I've got a complaint about order 1234, what happened?".
I don't want my answer to be "No errors were logged, so it's probably OK.", but to be able to give a timeline of what happened during sets of interactions.
To do this, it's critical to log what happened successfully, as well as what failed in your code, so that you can answer the question with real information.
I like to see:
I often find that comments in code that explain what's happening should really be a log message instead. You only get to look at comments when you're reading the code, but log messages are what you read when you're under pressure to solve a production problem quickly.
Use structured JSON logging.
CloudWatch Logs, CloudWatch Metrics and CloudWatch Log Insights all support log filtering and extraction from JSON logs, so using JSON logging makes working with log data much simpler.
The key change is to stop trying to build up a string containing data as your log entry. You should log the name of the "event" message, along with fields containing key/value pairs of associated data.
In Go, I use Uber's `zap` package [2] to do this.
package main import ( "go.uber.org/zap" ) func main() { log, err := zap.NewProduction() if err != nil { panic("could not create logger: " + err.Error()) } log.Error("fetch failed", zap.String("url", "https://example.com"), zap.Int("attempt", 3), ) }
The example code outputs a single line of JSON to standard output:
{"level":"error","ts":1257894000,"caller":"sandbox3700361711/prog.go:12","msg":"fetch failed","url":"https://example.com","attempt":3,"stacktrace":"..."}
While logging errors is useful once you've been alerted to an issue, you also need to be alerted to the fact that the errors are happening.
Looking at the logs every day is one approach, but I work on systems that are being updated constantly with new features. Waiting until tomorrow to check the logs would result in the loss of large quantities of business.
Like other repetitive tasks, it needs to be automated. We want errors being written to the logs to result in an alert.
CloudWatch has a feature called metric filters [3] that can extract data from JSON logs and turn them into metrics.
Once CloudWatch has a metric, an alarm can be set against it, which can trigger an alert to notify you:
Log → Metric → Alarm → Alert
Using CDK, I can create a Go Lambda function, and attach everything required to alert me to any errors using the `addErrorsLoggedAlarm` helper function.
f := awslambdago.NewGoFunction(stack, jsii.String("Handler"), &awslambdago.GoFunctionProps{ Runtime: awslambda.Runtime_GO_1_X(), Entry: jsii.String("./http"), Bundling: bundlingOptions, MemorySize: jsii.Number(1024), Timeout: awscdk.Duration_Millis(jsii.Number(15000)), Tracing: awslambda.Tracing_ACTIVE, Environment: &map[string]*string{ "AWS_XRAY_CONTEXT_MISSING": jsii.String("IGNORE_ERROR"), }, }) addErrorsLoggedAlarm(stack, jsii.String("HandlerErrorsLogged"), f, 1, alarmTopic)
The helper function creates the metric filter, sets an alarm on the metric, and configures the alarm to notify the SNS topic if it's triggered.
func addErrorsLoggedAlarm(stack awscdk.Stack, id *string, f awslambdago.GoFunction, errorsIn5MinuteWindow int, alarmTopic awssns.ITopic) { metricNamespace := f.Stack().StackName() awslogs.NewMetricFilter(stack, jsii.String(*id+"_MF"), &awslogs.MetricFilterProps{ LogGroup: f.LogGroup(), MetricNamespace: metricNamespace, MetricName: jsii.String("errorsLogged"), FilterPattern: awslogs.FilterPattern_Any( awslogs.FilterPattern_StringValue(jsii.String("$.level"), jsii.String("="), jsii.String("error")), awslogs.FilterPattern_StringValue(jsii.String("$.level"), jsii.String("="), jsii.String("ERROR")), ), MetricValue: jsii.String("1"), }) errorsAlarm := awscloudwatch.NewAlarm(stack, jsii.String(*id+"_Alarm"), &awscloudwatch.AlarmProps{ AlarmDescription: jsii.String("Error logged by service."), AlarmName: jsii.String(*id + "_ErrorsLoggedAlarm"), Metric: awscloudwatch.NewMetric(&awscloudwatch.MetricProps{ MetricName: jsii.String("errorsLogged"), Namespace: metricNamespace, Statistic: jsii.String("sum"), Period: awscdk.Duration_Minutes(jsii.Number(5)), }), EvaluationPeriods: jsii.Number(1), DatapointsToAlarm: jsii.Number(1), ComparisonOperator: awscloudwatch.ComparisonOperator_GREATER_THAN_THRESHOLD, Threshold: jsii.Number(float64(errorsIn5MinuteWindow)), ActionsEnabled: jsii.Bool(true), TreatMissingData: awscloudwatch.TreatMissingData_NOT_BREACHING, }) errorsAlarm.AddAlarmAction(awscloudwatchactions.NewSnsAction(alarmTopic)) }
The fact that an error was logged is usually enough to work out what went wrong, but checking on the rate of HTTP 500 errors for web applications can be worthwhile too. Especially if you don't have control of the application code.
I've made a similar helper function to handle that too.
fi := awsapigatewayv2integrations.NewHttpLambdaIntegration(jsii.String("DefaultHandlerIntegration"), f, &awsapigatewayv2integrations.HttpLambdaIntegrationProps{}) endpoint := awsapigatewayv2.NewHttpApi(stack, jsii.String("ApiGatewayV2API"), &awsapigatewayv2.HttpApiProps{ DefaultIntegration: fi, }) // Alert on > 40% server errors within a 5 minute window. addAPIGatewayErrorsAlarm(stack, jsii.String("HTTPApi500Errors"), endpoint, 0.4, alarmTopic)
The helper function is simpler than the metric filter, because API Gateway provides a built in `ServerError` metric to use.
The `endpoint.MetricServerError` function call in the helper function gets that particular metric.
func addAPIGatewayErrorsAlarm(stack awscdk.Stack, id *string, endpoint awsapigatewayv2.HttpApi, ratePercentage float64, alarmTopic awssns.ITopic) { errorsAlarm := awscloudwatch.NewAlarm(stack, id, &awscloudwatch.AlarmProps{ AlarmDescription: jsii.String("API Gateway 500 errors."), AlarmName: id, Metric: endpoint.MetricServerError(&awscloudwatch.MetricOptions{ Statistic: jsii.String("Average"), Period: awscdk.Duration_Minutes(jsii.Number(5)), }), EvaluationPeriods: jsii.Number(1), DatapointsToAlarm: jsii.Number(1), ComparisonOperator: awscloudwatch.ComparisonOperator_GREATER_THAN_THRESHOLD, Threshold: &ratePercentage, ActionsEnabled: jsii.Bool(true), TreatMissingData: awscloudwatch.TreatMissingData_NOT_BREACHING, }) errorsAlarm.AddAlarmAction(awscloudwatchactions.NewSnsAction(alarmTopic)) }
For Lambda functions driven by SQS, EventBridge, Kinesis, S3 events etc., the Lambda function signature returns an error rather than returning a HTTP 500 response.
func Handler(ctx context.Context, event events.CloudWatchEvent) (err error) { log := log.Info("event received", zap.String("source", event.Source), zap.String("type", event.DetailType)) return nil }
With a handler like this, as well as the dead letter configuration from the previous post, you might want to add an error alarm to alert if the rate of errors increases.
onEventHandler := awslambdago.NewGoFunction(stack, jsii.String("OnEventHandler"), &awslambdago.GoFunctionProps{ MemorySize: jsii.Number(1024), Timeout: awscdk.Duration_Seconds(jsii.Number(60)), Entry: jsii.String("./onevent"), Bundling: bundlingOptions, Runtime: awslambda.Runtime_GO_1_X(), // Dead letter handling configuration. RetryAttempts: jsii.Number(2), DeadLetterQueue: onEventHandlerDLQ, DeadLetterQueueEnabled: jsii.Bool(true), }) // Scrape JSON logs for errors and alert if any are found. addErrorsLoggedAlarm(stack, jsii.String("OnEventHandlerErrorsLogged"), onEventHandler, 1, alarmTopic) // Alert if over 40% of requests within a 5 minute window throw errors. addLambdaErrorsAlarm(stack, jsii.String("OnEventHandlerLambdaErrors"), onEventHandler, 0.4, alarmTopic)
The `addLambdaErrors` alarm takes a percentage rate.
func addLambdaErrorsAlarm(stack awscdk.Stack, id *string, f awslambdago.GoFunction, ratePercentage float64, alarmTopic awssns.ITopic) { errorsAlarm := awscloudwatch.NewAlarm(stack, id, &awscloudwatch.AlarmProps{ AlarmDescription: jsii.String("Error logged by service."), AlarmName: id, Metric: f.MetricErrors(&awscloudwatch.MetricOptions{ Statistic: jsii.String("Average"), Period: awscdk.Duration_Minutes(jsii.Number(5)), }), EvaluationPeriods: jsii.Number(1), DatapointsToAlarm: jsii.Number(1), ComparisonOperator: awscloudwatch.ComparisonOperator_GREATER_THAN_THRESHOLD, Threshold: &ratePercentage, ActionsEnabled: jsii.Bool(true), TreatMissingData: awscloudwatch.TreatMissingData_NOT_BREACHING, }) errorsAlarm.AddAlarmAction(awscloudwatchactions.NewSnsAction(alarmTopic)) }
Much like logging only gets you so far, if you don't know that the errors are happening, then you're not going to be able to do anything about it.
On my latest project, I'm using Atlassian OpsGenie to handle team rotas and escalations [4], since we were already using JIRA (not my choice).
OpsGenie has a mobile app, and will also email and phone us when triggered.
It's also possible to configure escalation rules based on date/time, priority, and other things. I use this to make sure that only P1 / P2 issues result in alerts outside of the team's normal working hours, and to filter out noisy alarms until we fix them properly at the infrastructure side.
OpsGenie has an SNS integration (a somewhat fancy name for receiving a webhook), so it's pretty easy to connect into the alarm system.
A complete, working example is available at [5].
Setting up an encrypted AWS Lambda dead letter queue with Go CDK