Telemetry Watchdog¶

The telemetry watchdog monitors memory, CPU, and disk — and prevents out-of-memory crashes on edge devices like Jetson, Raspberry Pi, or any long-running bubbaloop deployment.

Zero config required

The watchdog starts automatically with the daemon. No configuration needed unless you want to tune thresholds.

How It Works¶

Three components with clear separation of concerns:

graph LR
    S[Sampler] -->|snapshot + level| CB[Circuit Breaker]
    S -->|snapshot| RB[Ring Buffer]
    RB -->|batch flush| DB[(SQLite)]
    CB -->|alerts| Agent
    CB -->|kill| Nodes
    Agent -->|query| DB
    Agent -->|tune| Config[telemetry.toml]
    Config -->|hot-reload| S
    Config -->|hot-reload| CB

Component	What it does	Latency
Sampler	Reads memory/CPU/disk via `sysinfo`, adapts polling rate to pressure	Background
Circuit Breaker	Kills nodes at critical thresholds — no LLM needed	~100ms
Agent Bridge	Gives the AI agent tools + alerts for smart decisions	Seconds

The circuit breaker never waits for the agent

At Red/Critical levels, nodes are killed immediately. The agent is notified after the fact and decides whether to restart them.

Threshold Levels¶

The watchdog classifies system state into five levels based on memory usage:

graph LR
    G["Green\n< 60%"] --> Y["Yellow\n60-80%"]
    Y --> O["Orange\n80-90%"]
    O --> R["Red\n90-95%"]
    R --> C["Critical\n> 95%"]

    style G fill:#4caf50,color:white
    style Y fill:#ffeb3b,color:black
    style O fill:#ff9800,color:white
    style R fill:#f44336,color:white
    style C fill:#b71c1c,color:white

Level	Memory Used	Sampling	What Happens
Green	< 60%	Every 30s	Normal operation
Yellow	60-80%	Every 10s	Agent receives warning
Orange	80-90%	Every 5s	Agent receives urgent alert with top consumers
Red	90-95%	Every 5s	Circuit breaker kills largest non-essential node
Critical	> 95%	Every 5s	Circuit breaker kills ALL non-essential nodes

Additional thresholds (CPU and disk)

CPU > 95% for 60s sustained: Warning sent to agent (no auto-kill — high CPU is often legitimate)
Disk < 1GB free: Warning sent to agent
Disk < 200MB free: Critical disk alert

Configuration¶

All configuration lives in ~/.bubbaloop/telemetry.toml. Every field is optional — sensible defaults apply.

Minimal (recommended)Full reference

# Nothing needed! Defaults work for most deployments.
# Create this file only to override specific values.

[telemetry]
enabled = true
monitored_disk_path = "/"

[telemetry.sampling]
idle_secs = 30           # Green level polling interval
elevated_secs = 10       # Yellow level polling interval
critical_secs = 5        # Orange/Red/Critical polling interval
ring_capacity = 720      # In-memory samples (1hr at 5s)

[telemetry.thresholds]
yellow_pct = 60          # Memory % to trigger Yellow
orange_pct = 80          # Memory % to trigger Orange
red_pct = 90             # Memory % to trigger Red (auto-kill)
critical_pct = 95        # Memory % to trigger Critical (kill all)
cpu_warn_pct = 95        # CPU % for sustained warning
cpu_sustained_secs = 60  # Seconds CPU must stay above threshold
disk_warn_mb = 1024      # Disk free MB for warning
disk_critical_mb = 200   # Disk free MB for critical

[telemetry.circuit_breaker]
enabled = true
cooldown_secs = 30       # Min seconds between node kills

[telemetry.storage]
flush_interval_secs = 60 # How often to flush ring buffer to SQLite
retention_days = 7       # How long to keep history

Hot-Reload¶

Changes to telemetry.toml are picked up automatically — no daemon restart needed. The file is watched and reloaded within 200ms of saving.

Guardrails¶

The agent (or manual edits) cannot set unsafe values:

Parameter	Minimum	Maximum	Why
`critical_pct`	80	98	Prevent disabling safety or triggering too early
All sampling intervals	2s	No limit	Prevent CPU-burning poll loops

Values outside these ranges are automatically clamped and logged.

Agent Tools¶

The AI agent has three telemetry tools available via MCP dispatch.

`get_system_telemetry`¶

Returns current system state — memory, CPU, disk, top processes.

Example response

{
  "memory_used_percent": 62.3,
  "memory_available_mb": 3042,
  "cpu_usage_percent": 34.1,
  "disk_free_gb": 12.8,
  "watchdog_level": "Yellow",
  "top_processes": [
    {"name": "yolo-inference", "pid": 1234, "rss_mb": 890, "cpu_percent": 45.1},
    {"name": "rtsp-camera", "pid": 5678, "rss_mb": 412, "cpu_percent": 18.2}
  ]
}

`get_telemetry_history`¶

Returns historical data with trend analysis. Useful for detecting memory leaks.

Example response

{
  "duration_minutes": 60,
  "sample_count": 60,
  "samples": [
    {"timestamp_ms": 1709654400000, "memory_used_percent": 58.2, "cpu_percent": 30.1, "disk_free_gb": 13.1}
  ],
  "memory_trend_per_hour": 2.1,
  "trend_description": "memory_rising"
}

Trend descriptions

Trend	Meaning
`memory_rising_fast`	> 2%/hour — investigate immediately
`memory_rising`	0.5-2%/hour — potential leak
`stable`	Normal
`memory_falling`	Resources being freed

`update_telemetry_config`¶

The agent can tune thresholds at runtime. Only provided fields are updated. Guardrails enforced automatically.

{
  "red_pct": 85,
  "cooldown_secs": 60,
  "idle_secs": 15
}

System Prompt Injection¶

Every agent turn includes a one-line resource summary — no tool call needed:

## System Resources
Memory: 62% used (Yellow) | CPU: 34% | Disk: 12.8GB free

This gives the agent passive awareness of system health at zero cost.

Alert Events¶

The agent receives these alerts automatically via the broadcast channel (no polling):

Alert	When	Suggested Action
`ResourceWarning`	Yellow/Orange threshold crossed	Investigate top consumers
`NodeKilledByWatchdog`	Circuit breaker fired	Decide whether to restart
`ResourceRecovered`	Level dropped back to Green	Consider restarting killed nodes
`TrendAlert`	Sustained CPU high	Investigate workload

Architecture¶

graph TB
    subgraph Daemon
        subgraph Telemetry["Telemetry Watchdog"]
            Sampler["Sampler\n(sysinfo, adaptive rate)"]
            Ring["Ring Buffer\n(in-memory, 720 samples)"]
            CB["Circuit Breaker\n(threshold eval, node kill)"]
            Store["Storage\n(SQLite, 7-day retention)"]
            Config["Config Watcher\n(hot-reload)"]
        end

        NM["Node Manager"]
        AR["Agent Runtime"]
    end

    TOML["~/.bubbaloop/telemetry.toml"] --> Config
    Config --> Sampler
    Config --> CB
    Sampler --> Ring
    Sampler -->|"alert (level >= Yellow)"| CB
    Ring -->|"batch flush (60s)"| Store
    CB -->|"stop_node()"| NM
    CB -->|"WatchdogAlert"| AR
    Store -->|"query_history()"| AR
    Ring -->|"current_snapshot()"| AR

Data Storage¶

Storage	Location	Purpose	Retention
Ring Buffer	RAM	Fast circuit breaker reads	~720 samples (rolling)
SQLite	`~/.bubbaloop/telemetry.db`	Agent history queries, post-mortems	7 days (configurable)

The telemetry database is separate from the agent memory database to prevent write contention.

Cross-Platform Support¶

The watchdog uses the sysinfo crate — no /proc assumptions:

Platform	Memory	CPU	Disk	Processes
Linux (ARM64 / Jetson)	Yes	Yes	Yes	Yes
Linux (x86_64)	Yes	Yes	Yes	Yes
macOS	Yes	Yes	Yes	Yes

Troubleshooting¶

Watchdog killed my node

Check the daemon logs for [WATCHDOG] entries:

journalctl -u bubbaloop --since "1 hour ago" | grep WATCHDOG

The log shows which node was killed and why:

[WATCHDOG] RED: Stopping node 'yolo-inference' (RSS 890MB, memory available 8.2%)

Node keeps getting killed

The agent decides whether to restart killed nodes. If a node has a memory leak, it will be killed repeatedly. Options:

Fix the leak in the node
Relax the threshold temporarily — edit ~/.bubbaloop/telemetry.toml and increase red_pct
Mark as essential (future) — add essential: true to the node's node.yaml manifest

How do I disable the watchdog?

Disable entirely:

# ~/.bubbaloop/telemetry.toml
[telemetry]
enabled = false

Or disable only the circuit breaker (keep monitoring):

[telemetry.circuit_breaker]
enabled = false

Source Code¶

Module	File	Lines
Types & config	`crates/bubbaloop/src/daemon/telemetry/types.rs`	~250
Sampler	`crates/bubbaloop/src/daemon/telemetry/sampler.rs`	~150
Circuit breaker	`crates/bubbaloop/src/daemon/telemetry/circuit_breaker.rs`	~200
SQLite storage	`crates/bubbaloop/src/daemon/telemetry/storage.rs`	~150
Service & config watcher	`crates/bubbaloop/src/daemon/telemetry/mod.rs`	~250
Agent dispatch (3 tools)	`crates/bubbaloop/src/agent/dispatch.rs`	integrated

Source: crates/bubbaloop/src/daemon/telemetry/