Key Metrics & Alerts
The Prometheus metrics that tell you whether your Avalanche L1 is healthy, with recommended healthy, warning, and paging thresholds.
Once you have monitoring set up, the question is what to actually alert on. AvalancheGo exposes hundreds of Prometheus metrics. The query failure rate is the single most sensitive indicator of consensus health and the right primary alert — but it is not a catch-all. Three failures do not reliably show up in it and need their own alerts:
- Disk filling up — the node keeps participating in consensus until it runs out of space and shuts down.
- Bad blocks — your node's VM can reject proposed blocks while still answering queries from other validators normally, so a correctness divergence can accumulate without the failure rate moving.
- L1 validator balance running out — an L1 validator pays a continuous fee from a prepaid balance; when it empties, the validator goes inactive. AvalancheGo exposes no per-validator balance metric, so this must be tracked over RPC.
The metrics below are ordered by importance. Start with the first four — the failure rate plus its three blind spots — and use the rest mainly to diagnose why the failure rate moved.
Thresholds are guidelines — most of the values below reflect operational policy, not hard limits in the code. Validate them against your own chain's baseline before paging. Throughout, <chain> is your L1's blockchain ID (or its primary alias), which appears as the chain label on per-chain metrics.
Query failure rate
The single most sensitive indicator of L1 health. A poll is a round of voting where the node asks a sample of validators whether they prefer a block; it succeeds when enough validators respond in time. Because each validator carries a share of stake, the success rate drops by roughly an offline validator's stake share whenever one stops responding — so this one number catches networking faults, down validators, and finalization problems together.
| Healthy | ~100% successful |
| Warning | < 95% successful |
| Paging | < 90% successful |
As it falls, finalization slows. Once a node's connected stake drops below AlphaConfidence/K (75% at the defaults), it stops sending queries and the chain stalls for that node. AvalancheGo does not expose a success percentage directly — compute it from the Snowman poll counters (polls_successful, polls_failed):
rate(avalanche_snowman_polls_successful{chain="<chain>"}[5m])
/
(
rate(avalanche_snowman_polls_successful{chain="<chain>"}[5m])
+ rate(avalanche_snowman_polls_failed{chain="<chain>"}[5m])
)When this degrades, check the diagnostic metrics further down to find the cause.
Disk space remaining
A completely independent failure. The node keeps participating in consensus — with the query failure rate looking perfectly healthy — right up until it runs out of disk and shuts itself down. The failure rate gives you no warning of a disk problem, which is exactly why disk needs its own alert.
| Healthy | > 20% free |
| Warning | < 20% free |
| Paging | < 10% free |
avalanche_resource_tracker_disk_available_percentageAvalancheGo tracks free space on its database volume natively and self-governs on it: by default it reports itself unhealthy below 10% free and performs a fatal shutdown below 3% free (--system-tracker-disk-warning-available-space-percentage defaults to 10, --system-tracker-disk-required-available-space-percentage defaults to 3). Page at the 10% mark — the point the node itself flags unhealthy — so you have runway to add storage or prune well before the 3% shutdown. If you run in Kubernetes or on a managed host, alert on the equivalent volume-usage metric too, since the node's own metric stops reporting once the process is down.
Bad blocks (EVM L1s)
| Metric | avalanche_subnetevm_vm_eth_chain_block_bad_count{chain="<chain>"} |
| Healthy | 0 |
| Paging | any sustained increase |
Blocks that failed validation (state-root mismatch, invalid transactions). A rising count means this node is diverging from the network. It is an independent signal: the VM can reject proposed blocks while the node still answers queries from other validators normally, so bad blocks can accumulate without the failure rate moving — which is exactly why this gets its own alert.
The metric is namespaced by VM. A Subnet-EVM L1 runs the VM as an out-of-process plugin, so its metric carries a vm_ segment: avalanche_subnetevm_vm_eth_chain_block_bad_count. The in-process C-Chain (Coreth) has no vm_ segment: avalanche_evm_eth_chain_block_bad_count{chain="C"}. There is no built-in alert threshold — badBlockLimit (10) in the source is just an in-memory cache size, so alert on any sustained increase rather than a fixed count.
L1 validator balance (RPC)
| Source | platform.getCurrentValidators or platform.getL1Validator → balance (nAVAX) |
| Healthy | ample runway at your current burn rate |
| Paging | projected depletion within your top-up window |
Each L1 validator pays a continuous fee out of a prepaid AVAX balance; when that balance reaches 0 the validator becomes inactive and stops counting toward consensus. There is no Prometheus metric for an individual validator's remaining balance — the P-Chain exposes only network-wide aggregates — so poll the P-Chain RPC instead. One call returns every validator of your L1:
curl -s -X POST -H 'content-type:application/json' --data '{
"jsonrpc": "2.0", "id": 1,
"method": "platform.getCurrentValidators",
"params": {"subnetID": "<your subnet ID>"}
}' https://api.avax.network/ext/bc/PEach validator in the reply carries a balance field in nAVAX — for example "balance": "5251734528" is ~5.25 AVAX of remaining runway. To watch a single validator, platform.getL1Validator with its validationID returns the same field. (Note: getCurrentValidators only includes balance for subnets that have been converted to L1s; a legacy permissioned subnet returns the old staker format without it.)
Because the fee accrues at a predictable rate, alert on runway, not a fixed number: track the balance's slope and page when projected depletion falls inside your top-up turnaround time. If enough of an L1's stake goes inactive the chain stalls and the query failure rate rises — but by then the affected validators are already offline, so monitoring balance directly is what gives you advance warning.
Connected stake
| Metric | avalanche_stake_percent_connected{chain="<chain>"} |
| Healthy | ≥ 0.8 (80%) |
| Paging | < 0.8 (80%) |
The fraction of total validator stake the node has live connections to. Note this is a fraction in [0, 1], not a 0–100 value — compare against 0.8, or multiply by 100 to display a percentage. When it falls, the node cannot reach enough stake to complete polls — a direct cause of failure-rate drops. The node's own health check fails below ~80% (alpha/k plus a buffer, at the defaults); query sending stops below 75% (AlphaConfidence/K).
Processing blocks
| Metric | avalanche_snowman_blks_processing{chain="<chain>"} |
| Healthy | low and stable |
| Warning | sustained climb (e.g. > 6 over 5 min) |
| Paging | sustained spike (e.g. > 15 over 5 min) |
Blocks in consensus but not yet finalized. A sustained climb usually means finalization is stalling. The thresholds above are operational policy, not code defaults — AvalancheGo's own consensus health check trips on MaxOutstandingItems (256) and MaxItemProcessingTime (30s), so tune the numbers to your chain's block rate and watch the trend. To confirm whether the chain is genuinely stuck (versus just busy), check that avalanche_snowman_last_accepted_height{chain="<chain>"} is still increasing.
Benched validators
| Metric | avalanche_benchlist_benched_num{chain="<chain>"} |
| Healthy | 0 |
| Paging | > 1 over 10 min |
The count of peers the node has temporarily stopped querying because they keep failing. A non-zero value means at least one validator is unreachable, but the gauge doesn't name which one — you'll need the node's logs to identify it. Also note benchlisting is capped by stake: a high-stake validator can keep failing without ever being benched, so 0 doesn't guarantee every validator is healthy.
Number of validators
| Metric | avalanche_stake_num_validators{chain="<chain>"} |
| Healthy | your expected validator count |
| Paging | < 1 |
The size of the validator set the node currently sees. Dropping to 0 means it has lost its view of the set entirely (a P-Chain or L1 manager problem). With continuous staking, an L1's validators no longer expire together the way a legacy Subnet's validator periods could lapse and halt the chain, so a shrinking count matters less than it once did. The equivalent continuous-staking risk is individual validators going inactive when their balance runs out — monitor that directly (see L1 validator balance above).
Health check failures
| Metric | avalanche_health_checks_failing{check="health",tag="all"} |
| Healthy | 0 |
| Paging | > 0 (sustained) |
A catch-all gauge of how many checks are currently failing in the node's health endpoint (networking, router, database, disk, BLS key, pending upgrades, bootstrap status, validation). It carries two labels: check (one of health, liveness, readiness) and tag (all, application, or a specific subnet ID). Use tag="all" for the complete rollup — tag="application" covers only node-wide checks and excludes per-subnet ones, so it is not a true catch-all. To watch one L1 specifically, select that subnet's ID as the tag. Because the gauge reports the current count rather than an event total, any non-zero value already means the node is unhealthy — page on > 0, optionally requiring it to persist a minute or two to avoid flapping on transient checks.
CPU usage
| Metric | avalanche_resource_tracker_cpu_usage (and host/container CPU) |
| Healthy | well below your core count |
| Paging | sustained saturation |
AvalancheGo exposes its own CPU usage as avalanche_resource_tracker_cpu_usage, measured in cores (a value of 2.0 means two full cores), not a percentage. Watch this alongside host- or container-level CPU (which comes from your infrastructure, not AvalancheGo). Sustained saturation slows block verification and message handling, which shows up downstream as a higher failure rate.
Summary
| Metric | Page when |
|---|---|
Query failure rate (polls_successful / polls_failed) | < 90% successful |
Disk space remaining (disk_available_percentage) | < 10% free |
Bad blocks (subnetevm_vm_eth_chain_block_bad_count) | any sustained increase |
L1 validator balance (RPC getCurrentValidators → balance) | runway below your top-up window |
Connected stake (stake_percent_connected) | < 0.8 |
Processing blocks (blks_processing) | sustained spike |
Benched validators (benchlist_benched_num) | > 1 / 10 min |
Number of validators (stake_num_validators) | < 1 |
Health check failures (health_checks_failing) | > 0 sustained |
CPU usage (resource_tracker_cpu_usage) | sustained saturation |
Start with the query failure rate and disk alerts, add the bad-blocks alert, and — for L1s — poll each validator's balance over RPC. Put the remaining metrics on your dashboards so you can quickly find the cause when the failure rate moves.
Is this guide helpful?