How This AWS Architecture Keeps Notifying You Until You Fix It - Repeated Alarm

Table of Contents

By default, Amazon CloudWatch Alarms send a notification only when the alarm changes state — for example, from OK to ALARM. If the alarm remains in the ALARM state, no additional messages are sent. This can be a problem in production environments or for support teams who rely on timely updates about ongoing issues.

If the initial alert is missed, there’s no follow-up. The system stays silent even as the problem continues. This gap can lead to delays in response, longer downtime, and missed service-level targets — all because the alarm stopped talking after the first alert.

To solve this, AWS released a CDK-based solution that allows alarms to send repeated notifications while they remain in the ALARM state. In this blog, you'll learn how to set up that system in your own environment — step by step.

We’ll walk through how to:

  • Deploy the solution using the AWS CDK
  • Tag only the alarms you want to track
  • Set a custom interval for repeated notifications — whether that’s every 1 minute, 5 minutes, 10 minutes, or any interval that fits your monitoring strategy

This post also includes a full explanation of how the system works behind the scenes, a diagram that maps out the entire flow, and tips for customizing or troubleshooting the setup.

Solution Architecture

This solution is deployed as an AWS CDK application. It creates the following components in your AWS account:

  • CloudWatch Alarms – These monitor metrics and are tagged if you want them to send repeated alerts.
  • SNS Topic – Used to deliver alarm notifications (email, Slack, etc.).
  • EventBridge Rule – Captures alarm state change events when the alarm enters ALARM.
  • Step Function – Orchestrates the loop that periodically checks alarm state.
  • Lambda Function – Checks the alarm's current status and sends another alert if the alarm is still in ALARM.
  • IAM Roles – Grant permissions for Lambda, Step Functions, and EventBridge.
  • (Optional) A CloudWatch Resource Group is created for all tagged alarms for easier monitoring.

All of these components work together to monitor the alarms and send repeated notifications as long as the problem persists.


How It Works

Here's how the full system behaves once deployed:

  1. A CloudWatch alarm is triggered and enters the ALARM state.
  2. The alarm sends a one-time SNS notification to its associated topic.
  3. EventBridge picks up the state change event and invokes a Step Function.
  4. The Step Function waits for a configured period (e.g., 300 seconds).
  5. It invokes a Lambda function, which:
    • Checks if the alarm has the tag RepeatedAlarm:true.
    • Calls DescribeAlarms to verify if the alarm is still in the ALARM state.
    • If true, sends another notification via SNS.
  6. A Choice step decides:
    • If still in ALARM, the process loops back to the Wait step.
    • If not, the Step Function ends and notifications stop.

The repeated notifications stop if:

  • The alarm changes state (e.g., from ALARM to OK)
  • The alarm is deleted

The RepeatedAlarm:true tag is removed

Step-by-Step Setup: Repeated Notifications for CloudWatch Alarms

This section provides a complete walkthrough of how to deploy the solution using the AWS Cloud Development Kit (CDK). You’ll clone the project, build it, deploy it, apply tags to alarms, and verify that repeated notifications are working.

Prerequisites

Before you begin, make sure you have the following:

  • An AWS account with permissions to manage CloudWatch, Lambda, Step Functions, EventBridge, and IAM
  • AWS CLI configured with appropriate credentials
  • Node.js version 10.13 or later
  • AWS CDK installed (npm install -g aws-cdk)
  • Docker running (required during the build process)

Step 1: Clone the Repository

Clone the official AWS sample project for repeated notifications:

git clone https://github.com/aws-samples/amazon-cloudwatch-alarms-repeated-notification-cdk.git
cd amazon-cloudwatch-alarms-repeated-notification-cdk

Step 2: Install Dependencies

Install the necessary Node.js modules for the CDK project.

npm install

Step 3: Build the Project

Compile the TypeScript source code into JavaScript.

npm run build

Step 4: Bootstrap the CDK Environment

Prepare your AWS environment for CDK deployments. This step creates required resources like an S3 bucket for storing deployment artifacts.

cdk bootstrap

Step 5: Deploy the CDK Application

Deploy the entire solution to your AWS account, including the Step Function, Lambda, EventBridge rule, and IAM roles.

cdk deploy \
  --parameters RepeatedNotificationPeriod=300 \
  --parameters TagForRepeatedNotification=RepeatedAlarm:true \
  --parameters RequireResourceGroup=false

Explanation of parameters:

  • RepeatedNotificationPeriod: Interval in seconds between notifications (e.g., 300 = 5 minutes)
  • TagForRepeatedNotification: Tag key and value to identify alarms that should send repeated notifications
  • RequireResourceGroup: Whether to create a CloudWatch resource group (optional)

After deployment, you can verify the setup by visiting the AWS Console. Navigate to Step Functions to inspect the state machine and its workflow.

You can also open the deployed Lambda function to review or modify its environment variables and logic.
This is especially useful if you want to extend the functionality, such as sending alerts to a different SNS topic, filtering alarms by additional tags, or integrating with third-party tools.


This gives you the flexibility to customize how repeated notifications behave. For example, you might want to send alerts to a different SNS topic, include additional alarm metadata in messages, or integrate it with your incident management system.After verifying the deployment, we’ll revisit this Lambda function later to make specific custom changes.


Step 6: Tag the Alarms You Want to Monitor

Only alarms that have the specified tag will be checked by the system. Apply the tag using the following command:

aws cloudwatch tag-resource \
  --resource-arn arn:aws:cloudwatch:<region>:<account_id>:alarm:<alarm_name> \
  --tags Key=RepeatedAlarm,Value=true

Replace the placeholders with your values:

  • <region>: Your AWS region (e.g., us-east-1)
  • <account_id>: Your AWS account ID
  • <alarm_name>: The name of your CloudWatch Alarm

Step 7: Confirm Tagging (Optional)

Verify that the tag was applied successfully:

aws cloudwatch list-tags-for-resource \
  --resource-arn arn:aws:cloudwatch:<region>:<account_id>:alarm:<alarm_name>

You should see output similar to:

{
  "Tags": [
    {
      "Key": "RepeatedAlarm",
      "Value": "true"
    }
  ]
}

Step 8: Trigger and Observe Repeated Notifications

Generate a test alarm that enters the ALARM state and remains active. You should receive:

  • One initial notification (standard CloudWatch behavior)
  • Follow-up notifications based on your configured interval (e.g., every 300 seconds)

Step 9: Monitor Logs (Optional)

To confirm the Lambda function is running and sending notifications, go to the CloudWatch Logs console and locate the following log group:

/aws/lambda/RepeatedCloudWatchAlarm

Each invocation will show whether the alarm was still in ALARM and whether a message was sent.


Modifying the Lambda Function for Custom Behavior

After the solution is deployed, you can navigate to the AWS Lambda console and locate the generated function (usually named something like RepeatedCloudWatchAlarmSt-checkAlarmStatusLambda...). From there, you can directly update the logic to suit your requirements.

For example, here's a summary of the enhancements made in the Lambda function:

  • Slack Mentions: Added SLACK_USER_ID as an environment variable to mention specific Slack users or groups in alert messages.
  • Current Metric Value: Integrated get_metric_statistics using boto3 to fetch the current metric value and include it in the message.
  • Custom Message Format: Structured a clean JSON-formatted custom message, with context-specific fields like Queue name, Broker name, Threshold, and Current Value.
  • Custom SNS Topic Support: Used environment variables SEND_TO_CUSTOM_SNS and CUSTOM_SNS_TOPIC_ARN to conditionally send messages to an alternative topic.

Code Snippet Example (Custom Notification Block)

if os.getenv("SEND_TO_CUSTOM_SNS", "false").lower() == "true":
    SNS_CLIENT.publish(
        TopicArn=os.getenv("CUSTOM_SNS_TOPIC_ARN"),
        Subject=f"Custom Notification: {alarm_name}",
        Message=json.dumps({
            "AlarmName": alarm_name,
            "Region": session.region_name,
            "Metric": metric_name,
            "Queue": queue_name,
            "Broker": broker_name,
            "CurrentValue": current_value_str,
            "Threshold": alarm_details.get("Threshold"),
            "ActionRequired": f"<!subteam^{SLACK_USER_ID}> Check immediately."
        })
    )

This custom logic allows you to fine-tune alert delivery, include dynamic context in your messages, and integrate easily with messaging platforms like Slack.

This section provides a complete walkthrough of how to deploy the solution using the AWS Cloud Development Kit (CDK). You’ll clone the project, build it, deploy it, apply tags to alarms, and verify that repeated notifications are working.

Here is the complete code:

import json
import os
from typing import List
import datetime
import logging
import boto3
# Set up logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Initialize AWS clients
session = boto3.session.Session()
CW_CLIENT = session.client('cloudwatch')
SNS_CLIENT = session.client('sns')
SLACK_USER_ID = os.getenv("SLACK_USER_ID")
SNS_SUBJECT_LIMIT = 200  # AWS SNS subject character limit
def lambda_handler(event, context):
    """ Lambda entrypoint for the CheckAlarmStatus Lambda Function """
    logger.info(event)
    event.update({"currState": "null"})
    try:
        alarm_arn = event["resources"][0]
        alarm_name = event["detail"].get("alarmName")
        alarm_tags = CW_CLIENT.list_tags_for_resource(ResourceARN=alarm_arn)
        logger.info(alarm_tags)
        if check_if_repeated_alarm_enabled(alarm_tags.get("Tags")):
            alarm_response = CW_CLIENT.describe_alarms(
                AlarmNames=[alarm_name],
                AlarmTypes=["CompositeAlarm", "MetricAlarm"]
            )
            logger.info(alarm_response)
            if len(alarm_response.get("MetricAlarms")) > 0:
                alarm_details = alarm_response.get("MetricAlarms")[0]
            elif len(alarm_response.get("CompositeAlarms")) > 0:
                alarm_details = alarm_response.get("CompositeAlarms")[0]
            else:
                raise Exception("No alarms found.")
            alarm_details = json.loads(json.dumps(alarm_details, default=datetime_converter))
            if alarm_details.get("StateValue") == "ALARM":
                if os.getenv("SEND_TO_CUSTOM_SNS", "false").lower() == "true":
                    send_to_custom_sns(alarm_name, alarm_details)
                else:
                    associated_alarm_actions = alarm_details.get("AlarmActions")
                    for action in associated_alarm_actions:
                        if action.startswith(os.getenv("ARN_PREFIX") + "sns"):
                            notification_subject = create_notification_subject(alarm_name)
                            SNS_CLIENT.publish(
                                TopicArn=action,
                                Subject=notification_subject,
                                Message=json.dumps(alarm_details)
                            )
                            logger.info("Publish to %s" % action)
            event["currState"] = alarm_details.get("StateValue")
    except Exception as e:
        logger.error(f"Error: {repr(e)}")
        raise
    return event
def send_to_custom_sns(alarm_name, alarm_details):
    """Send a custom notification to a custom SNS topic"""
    custom_topic_arn = os.getenv("CUSTOM_SNS_TOPIC_ARN")
    if not custom_topic_arn:
        logger.error("CUSTOM_SNS_TOPIC_ARN is not set.")
        raise ValueError("CUSTOM_SNS_TOPIC_ARN environment variable is missing.")
    namespace = alarm_details.get("Namespace")
    metric_name = alarm_details.get("MetricName")
    dimensions = alarm_details.get("Dimensions")
    # Extract specific dimensions
    queue_name = get_dimension_value(dimensions, "Queue")
    broker_name = get_dimension_value(dimensions, "Broker")
    # Fetch current metric value
    current_value = get_current_metric_value(namespace, metric_name, dimensions)
    current_value_str = str(current_value) if current_value is not None else "N/A"
    # Construct custom message
    custom_message = json.dumps({
        "AlarmName": alarm_name,
        "AlarmDescription": f"\ud83d\udea8 Alert! {alarm_name} has been in ALARM state for 30+ minutes.",
        "NewStateReason": f"Metric Name: {metric_name}\n"
                          f"Queue: {queue_name}\n"
                          f"Broker: {broker_name}\n"
                          f"Threshold: {alarm_details.get('Threshold')}\n"
                          f"Current Value: {current_value_str}\n"
                          f"Action Required: <!subteam^{SLACK_USER_ID}>",
        "Region": session.region_name,
        "NewStateValue": "ALARM",
        "OldStateValue": "OK"
    })
    SNS_CLIENT.publish(
        TopicArn=custom_topic_arn,
        Subject=f"Custom Notification: {alarm_name}",
        Message=custom_message
    )
    logger.info("Custom notification sent to %s" % custom_topic_arn)
def create_notification_subject(alarm_name):
    notification_subject = f"ALARM: \"{alarm_name}\" remains in ALARM state in {session.region_name}"
    if len(notification_subject) >= SNS_SUBJECT_LIMIT:
        number_of_char_to_remove = len(notification_subject) - SNS_SUBJECT_LIMIT + 4
        notification_subject = f"ALARM: \"{alarm_name[:-number_of_char_to_remove]}...\" remains in ALARM state in {session.region_name}"
    return notification_subject
def datetime_converter(field):
    if isinstance(field, datetime.datetime):
        return field.__str__()
def check_if_repeated_alarm_enabled(tags: List[dict], expected_tag="TagForRepeatedNotification"):
    tag_to_check = os.getenv(expected_tag).split(":")
    key = tag_to_check[0]
    value = tag_to_check[1]
    for tag in tags:
        if tag.get("Key") == key and tag.get("Value") == value:
            return True
    return False
def get_current_metric_value(namespace, metric_name, dimensions):
    try:
        now = datetime.datetime.utcnow()
        start_time = now - datetime.timedelta(minutes=10)
        response = CW_CLIENT.get_metric_statistics(
            Namespace=namespace,
            MetricName=metric_name,
            Dimensions=dimensions,
            StartTime=start_time,
            EndTime=now,
            Period=300,  # 5 min
            Statistics=["Average"]
        )
        datapoints = response.get("Datapoints", [])
        if datapoints:
            latest_datapoint = sorted(datapoints, key=lambda x: x['Timestamp'])[-1]
            return round(latest_datapoint["Average"], 2)
        else:
            logger.warning("No datapoints available for metric.")
            return None
    except Exception as e:
        logger.error(f"Failed to get current metric value: {repr(e)}")
        return None
def get_dimension_value(dimensions, key):
    for d in dimensions:
        if d["Name"] == key:
            return d["Value"]
    return "N/A"

This custom logic allows you to fine-tune alert delivery, include dynamic context in your messages, and integrate easily with messaging platforms like Slack.

Cost Overview

Setting up repeated notifications for CloudWatch alarms using Step Functions and Lambda is low-cost for small-scale use, but costs can increase based on the number of alarms and notification frequency.

Estimated Cost Breakdown (Per Alarm)

This assumes each notification loop involves 5 Step Function transitions.

Service-Based Pricing Summary

Costs are typically minimal for small setups, but teams with hundreds of alarms should monitor usage.

Conclusion

With this setup, you've created a solution that sends repeated notifications for alarms that remain in the ALARM state, helping avoid missed alerts during ongoing issues.

This approach gives you flexibility through tagging, CDK-based deployment, and the option to customize notification content and behavior.

If you're building or scaling observability, monitoring, or cloud infrastructure, our team at KubeNine can help. We offer hands-on support across AWS, Kubernetes, alerting workflows, and DevOps pipelines. Reach out to us at contact@kubenine.com.