Heartbeat Monitoring

Quick Summary — TL;DR

Heartbeat monitoring works by expecting a ping from your job after each run — if the ping stops arriving, the monitor alerts you that the job has gone silent.
The opposite of uptime monitoring: instead of a service pinging your app, your app pings the service. Silence means failure.
Grace periods absorb normal timing variance so a job that runs a few minutes late does not trigger a false alarm.

Heartbeat monitoring (also called a dead man's switch) is a monitoring pattern where your job sends a signal — a "heartbeat" — to an external endpoint after each successful run. The monitoring service tracks these heartbeats. If an expected heartbeat does not arrive within a defined window, the service assumes the job has failed or stopped running and sends an alert.

How heartbeat monitoring works

The flow is simple:

Configure — create a heartbeat monitor with an expected interval (e.g., "expect a ping every hour")
Instrument — add a single HTTP request to the end of your job that pings the monitor's URL
Monitor — the service records each ping and starts a countdown to the next expected one
Alert — if the countdown expires without a new ping, the monitor fires an alert (email, Slack, webhook)
Recover — when the next heartbeat arrives, the monitor marks the job as healthy and optionally sends a recovery notification

The key insight is that the monitor does not know or care what your job does. It only knows whether the job checked in on time. This makes heartbeat monitoring applicable to anything that runs on a schedule: cron jobs, batch processing, database backups, data pipelines, queue workers.

Heartbeat monitoring vs uptime monitoring

The two patterns work in opposite directions:

Feature	Uptime monitoring	Heartbeat monitoring
Direction	Monitor pings your service	Your job pings the monitor
Detects	Service is down or slow	Job did not run or failed silently
Best for	Web servers, APIs, public endpoints	Cron jobs, batch tasks, background processes
Failure signal	Error response or timeout	Absence of a signal (silence)

Uptime monitoring tells you "your API is down." Heartbeat monitoring tells you "your nightly backup did not run." They solve different problems and are complementary.

Use cases

Cron jobs — a cron-scheduled task that fails silently (wrong PATH, permission error, crashed script) will never ping the monitor, triggering an alert
Database backups — if your backup script does not complete, you want to know before you need to restore
Data pipelines — ETL jobs that stall or skip runs without heartbeat monitoring may go unnoticed for days
Queue workers — a worker process that crashes and is not restarted stops sending heartbeats, surfacing the failure
Batch processing — nightly invoice generation, report builds, or cache warming tasks that must complete on schedule

Grace periods

Jobs do not always run at exactly the same time. A task scheduled every hour might finish at 10:01 one run and 10:04 the next. A grace period adds a buffer to the expected interval before the monitor considers the heartbeat late.

For example, if your job runs every hour and you set a 10-minute grace period, the monitor waits 70 minutes after the last ping before alerting. This absorbs normal variance in execution time and prevents false alarms without delaying real alerts significantly.

Choosing the right grace period depends on how long your job takes to run and how much variance is normal. A job that takes 2 minutes with minimal variance needs a short grace period (5 minutes). A job that takes 30–45 minutes needs a longer one (15–20 minutes).

What happens when a heartbeat is missed

When the grace period expires without a ping, the monitor transitions the job's status from healthy to failed and fires an alert. The typical sequence:

Grace period expires — monitor marks the job as "missed"
Alert is sent — email, Slack message, PagerDuty incident, or webhook to your alerting system
Investigation — you check logs, server status, and the cron daemon to find out why the job did not run
Recovery — once the job runs again and sends a heartbeat, the monitor marks the job as healthy and optionally sends a recovery alert

Implementing heartbeat pings

Adding a heartbeat to an existing job is typically a single line at the end of your script:

Shell — curl -fsS --retry 3 https://monitor.example.com/ping/abc123
Python — requests.get("https://monitor.example.com/ping/abc123")
Node.js — fetch("https://monitor.example.com/ping/abc123")

Place the ping after the main work completes successfully. If you put it at the beginning, the heartbeat fires even when the job fails partway through — defeating the purpose.

FAQ

What is a dead man's switch in cron monitoring?

A dead man's switch (heartbeat monitor) is a system that expects regular check-ins from your cron job. If the job stops checking in — because it crashed, was misconfigured, or the server went down — the switch triggers an alert. The name comes from the physical dead man's switch: a control that must be actively held, and releases automatically if the operator is incapacitated.

How is heartbeat monitoring different from uptime monitoring?

Uptime monitoring actively pings your service to check if it responds. Heartbeat monitoring passively waits for your job to ping it. Uptime monitoring catches "your web server is down." Heartbeat monitoring catches "your background job stopped running." They are complementary — most production systems use both.

What is a grace period?

A grace period is a buffer added to the expected heartbeat interval before an alert fires. If your job runs every hour and the grace period is 10 minutes, the monitor waits 70 minutes after the last ping before alerting. This prevents false alarms from minor timing variations in job execution.

Heartbeat monitoring is essential for catching silent failures in cron-scheduled tasks and background jobs that run on a fixed schedule. It complements active health checks by monitoring jobs that cannot be pinged from the outside. For a practical walkthrough, see our guide on monitoring cron jobs with alerts.