Quick Summary — TL;DR
Heartbeat monitoring (also called a dead man's switch) is a monitoring pattern where your job sends a signal — a "heartbeat" — to an external endpoint after each successful run. The monitoring service tracks these heartbeats. If an expected heartbeat does not arrive within a defined window, the service assumes the job has failed or stopped running and sends an alert.
The flow is simple:
The key insight is that the monitor does not know or care what your job does. It only knows whether the job checked in on time. This makes heartbeat monitoring applicable to anything that runs on a schedule: cron jobs, batch processing, database backups, data pipelines, queue workers.
The two patterns work in opposite directions:
| Feature | Uptime monitoring | Heartbeat monitoring |
|---|---|---|
| Direction | Monitor pings your service | Your job pings the monitor |
| Detects | Service is down or slow | Job did not run or failed silently |
| Best for | Web servers, APIs, public endpoints | Cron jobs, batch tasks, background processes |
| Failure signal | Error response or timeout | Absence of a signal (silence) |
Uptime monitoring tells you "your API is down." Heartbeat monitoring tells you "your nightly backup did not run." They solve different problems and are complementary.
Jobs do not always run at exactly the same time. A task scheduled every hour might finish at 10:01 one run and 10:04 the next. A grace period adds a buffer to the expected interval before the monitor considers the heartbeat late.
For example, if your job runs every hour and you set a 10-minute grace period, the monitor waits 70 minutes after the last ping before alerting. This absorbs normal variance in execution time and prevents false alarms without delaying real alerts significantly.
Choosing the right grace period depends on how long your job takes to run and how much variance is normal. A job that takes 2 minutes with minimal variance needs a short grace period (5 minutes). A job that takes 30–45 minutes needs a longer one (15–20 minutes).
When the grace period expires without a ping, the monitor transitions the job's status from healthy to failed and fires an alert. The typical sequence:
Adding a heartbeat to an existing job is typically a single line at the end of your script:
curl -fsS --retry 3 https://monitor.example.com/ping/abc123requests.get("https://monitor.example.com/ping/abc123")fetch("https://monitor.example.com/ping/abc123")Place the ping after the main work completes successfully. If you put it at the beginning, the heartbeat fires even when the job fails partway through — defeating the purpose.
A dead man's switch (heartbeat monitor) is a system that expects regular check-ins from your cron job. If the job stops checking in — because it crashed, was misconfigured, or the server went down — the switch triggers an alert. The name comes from the physical dead man's switch: a control that must be actively held, and releases automatically if the operator is incapacitated.
Uptime monitoring actively pings your service to check if it responds. Heartbeat monitoring passively waits for your job to ping it. Uptime monitoring catches "your web server is down." Heartbeat monitoring catches "your background job stopped running." They are complementary — most production systems use both.
A grace period is a buffer added to the expected heartbeat interval before an alert fires. If your job runs every hour and the grace period is 10 minutes, the monitor waits 70 minutes after the last ping before alerting. This prevents false alarms from minor timing variations in job execution.
Heartbeat monitoring is essential for catching silent failures in cron-scheduled tasks and background jobs that run on a fixed schedule. It complements active health checks by monitoring jobs that cannot be pinged from the outside. For a practical walkthrough, see our guide on monitoring cron jobs with alerts.
Recuro handles cron scheduling, retries, alerts, and execution logs -- so you can focus on building your product.
No credit card required