Your OpenClaw Agent Went Down at 3am. Here's How to Know Sooner.
I found out my agent was down because a customer emailed me. Not a monitoring system. Not an alert. A person, annoyed, asking why the chatbot had been unresponsive for six hours. It was a Saturday. The agent had crashed sometime around 3am. Nobody knew.
This is embarrassingly common with OpenClaw deployments. And it's not because people are lazy about monitoring. It's because agents fail in ways that traditional monitoring doesn't catch.
## Why agents fail silently
Web servers crash and return 500 errors. Databases go down and queries fail loudly. But OpenClaw agents have a special talent for dying without telling anyone.
**Zombie processes.** The agent process is still running. It consumes memory, it holds the port open, the health endpoint returns 200. But internally, the LLM connection has timed out, the event loop is blocked, or a skill has entered an infinite retry loop. From the outside, everything looks fine. From the user's perspective, the agent stopped responding minutes or hours ago.
**Silent skill failures.** A skill throws an exception, but the agent catches it internally and continues running in a degraded state. Instead of crashing (which would be detectable), it just stops performing certain tasks. Users notice that the agent "forgot" how to do something, but there's no error in any log.
**Memory leaks that don't crash.** OpenClaw agents running long conversations accumulate context. Some skills leak memory gradually. The agent slows down over days until response times are measured in minutes instead of seconds. It's technically "up" the whole time.
**DNS and certificate expiry.** Your domain's SSL certificate expires at 3am on a Sunday. The agent is running fine on the server, but HTTPS connections from users fail silently on the client side. The health check, running locally, sees no problem.
**Dependency outages.** Your agent depends on an LLM API (OpenAI, Anthropic, a self-hosted model). The API goes down or rate-limits you. The agent's health check might not test the LLM connection. Even if it does, the check runs every 60 seconds and the outage might start between checks.
## The three monitoring layers you actually need
After getting burned enough times, I've landed on three layers that cover the failure modes above.
### Layer 1: External health checks
An external service pings your agent from outside your network every 60 seconds. Not just an HTTP ping. A synthetic test that sends a simple question to your agent and verifies the response is coherent. This catches zombie processes (the agent accepts the request but never responds), DNS/SSL issues (the external check fails before reaching your server), and complete outages.
The key word is "external." A health check running on the same server as your agent is useless when the server goes down. You need a check from a different network, ideally from multiple geographic locations to avoid false positives from routing issues.
ClawPulsar runs external checks from three regions. If two out of three fail, it fires an alert. This avoids waking you up because of a transient network blip in one data center.
### Layer 2: Processing verification
External checks confirm your agent is reachable and responsive. Processing verification confirms it's actually doing its job.
Set up a canary task that runs every 15 minutes. This is a real task (not a health check) that exercises a common workflow. For a customer support agent, it might be a test question with a known answer. For a data processing agent, it might be a small test payload with expected output.
Compare the result against expected output. If the canary task fails or returns unexpected results, something is wrong even if the health check passes. This catches skill failures, degraded LLM responses, and corrupted agent state.
The canary approach has a cost: every canary task uses LLM tokens. At 4 canary checks per hour, you're looking at maybe $2-5/month in API costs depending on your LLM. Worth it for production agents. Probably overkill for development instances.
### Layer 3: Log-based alerting
The first two layers check from outside. Log-based alerting watches from inside. Monitor your agent's logs for specific patterns:
- Error rates exceeding baseline (more than 5 errors per hour when baseline is 1) - Response time degradation (p95 response time doubles) - Memory usage trending upward without plateauing - LLM API errors (rate limits, timeouts, auth failures) - Skill loading failures on restart
The trick with log-based alerting is not over-alerting. If you alert on every error, you'll get alert fatigue within a week and start ignoring everything. Set thresholds based on your actual baseline, not on some theoretical ideal. If your agent normally throws 2 errors per hour (it probably does, from weird user inputs), don't alert until you hit 10.
## Setting up alerts that don't ruin your sleep
The most common mistake is routing every alert to your phone at maximum priority. At 3am, you need to know about two categories of problems: the agent is completely down (users are getting no response), or the agent is actively causing harm (sending wrong information, leaking data). Everything else can wait until morning.
Structure your alerts into tiers:
**Page immediately:** Agent unreachable for more than 5 minutes. Canary task failing for more than 15 minutes. Security-related errors (auth failures, unexpected data access).
**Slack notification (check within 2 hours):** Error rate elevated but agent still functional. Response time degraded but still under 30 seconds. Single skill failure that doesn't affect core functionality.
**Dashboard only (check daily):** Memory usage trending upward. Minor version mismatch warnings. Non-critical skill deprecation notices.
ClawPulsar supports all three tiers with per-endpoint configuration. Your payment processing agent gets page-immediately sensitivity. Your internal FAQ bot gets Slack-notification sensitivity. Your development test agent gets dashboard-only.
## The 15-minute setup
If you take nothing else from this post, do this:
1. Sign up for a free uptime monitor (UptimeRobot, Better Uptime, whatever). Point it at your agent's health endpoint. Set it to check every 60 seconds and alert via email.
2. Add a basic log grep that runs every 5 minutes and counts error lines. If the count exceeds 10, send yourself a Slack message. A cron job with curl is fine.
3. Test your alerts by deliberately killing your agent and verifying you get notified within 2 minutes.
This isn't sophisticated monitoring. It's a smoke detector. It won't tell you exactly what's wrong, but it'll tell you something is wrong before your customers do. Upgrade to proper monitoring (ClawPulsar or equivalent) when the agent becomes business-critical.