Recuro.

Heartbeat Monitoring

Quick Summary — TL;DR

  • Heartbeat monitoring works by expecting a ping from your job after each run — if the ping stops arriving, the monitor alerts you that the job has gone silent.
  • The opposite of uptime monitoring: instead of a service pinging your app, your app pings the service. Silence means failure.
  • Grace periods absorb normal timing variance so a job that runs a few minutes late does not trigger a false alarm.

Heartbeat monitoring (also called a dead man's switch) is a monitoring pattern where your job sends a signal — a "heartbeat" — to an external endpoint after each successful run. The monitoring service tracks these heartbeats. If an expected heartbeat does not arrive within a defined window, the service assumes the job has failed or stopped running and sends an alert.

How heartbeat monitoring works

The flow is simple:

  1. Configure — create a heartbeat monitor with an expected interval (e.g., "expect a ping every hour")
  2. Instrument — add a single HTTP request to the end of your job that pings the monitor's URL
  3. Monitor — the service records each ping and starts a countdown to the next expected one
  4. Alert — if the countdown expires without a new ping, the monitor fires an alert (email, Slack, webhook)
  5. Recover — when the next heartbeat arrives, the monitor marks the job as healthy and optionally sends a recovery notification

The key insight is that the monitor does not know or care what your job does. It only knows whether the job checked in on time. This makes heartbeat monitoring applicable to anything that runs on a schedule: cron jobs, batch processing, database backups, data pipelines, queue workers.

Heartbeat monitoring vs uptime monitoring

The two patterns work in opposite directions:

Feature Uptime monitoring Heartbeat monitoring
DirectionMonitor pings your serviceYour job pings the monitor
DetectsService is down or slowJob did not run or failed silently
Best forWeb servers, APIs, public endpointsCron jobs, batch tasks, background processes
Failure signalError response or timeoutAbsence of a signal (silence)

Uptime monitoring tells you "your API is down." Heartbeat monitoring tells you "your nightly backup did not run." They solve different problems and are complementary.

Use cases

Grace periods

Jobs do not always run at exactly the same time. A task scheduled every hour might finish at 10:01 one run and 10:04 the next. A grace period adds a buffer to the expected interval before the monitor considers the heartbeat late.

For example, if your job runs every hour and you set a 10-minute grace period, the monitor waits 70 minutes after the last ping before alerting. This absorbs normal variance in execution time and prevents false alarms without delaying real alerts significantly.

Choosing the right grace period depends on how long your job takes to run and how much variance is normal. A job that takes 2 minutes with minimal variance needs a short grace period (5 minutes). A job that takes 30–45 minutes needs a longer one (15–20 minutes).

What happens when a heartbeat is missed

When the grace period expires without a ping, the monitor transitions the job's status from healthy to failed and fires an alert. The typical sequence:

  1. Grace period expires — monitor marks the job as "missed"
  2. Alert is sent — email, Slack message, PagerDuty incident, or webhook to your alerting system
  3. Investigation — you check logs, server status, and the cron daemon to find out why the job did not run
  4. Recovery — once the job runs again and sends a heartbeat, the monitor marks the job as healthy and optionally sends a recovery alert

Implementing heartbeat pings

Adding a heartbeat to an existing job is typically a single line at the end of your script:

Place the ping after the main work completes successfully. If you put it at the beginning, the heartbeat fires even when the job fails partway through — defeating the purpose.

FAQ

What is a dead man's switch in cron monitoring?

A dead man's switch (heartbeat monitor) is a system that expects regular check-ins from your cron job. If the job stops checking in — because it crashed, was misconfigured, or the server went down — the switch triggers an alert. The name comes from the physical dead man's switch: a control that must be actively held, and releases automatically if the operator is incapacitated.

How is heartbeat monitoring different from uptime monitoring?

Uptime monitoring actively pings your service to check if it responds. Heartbeat monitoring passively waits for your job to ping it. Uptime monitoring catches "your web server is down." Heartbeat monitoring catches "your background job stopped running." They are complementary — most production systems use both.

What is a grace period?

A grace period is a buffer added to the expected heartbeat interval before an alert fires. If your job runs every hour and the grace period is 10 minutes, the monitor waits 70 minutes after the last ping before alerting. This prevents false alarms from minor timing variations in job execution.

Heartbeat monitoring is essential for catching silent failures in cron-scheduled tasks and background jobs that run on a fixed schedule. It complements active health checks by monitoring jobs that cannot be pinged from the outside. For a practical walkthrough, see our guide on monitoring cron jobs with alerts.

Stop managing infrastructure. Start scheduling jobs.

Recuro handles cron scheduling, retries, alerts, and execution logs -- so you can focus on building your product.

No credit card required