Runbooks provide a structured response to known incident types. Each runbook follows the same phases: detect, assess, act, verify, and document.Documentation Index
Fetch the complete documentation index at: https://docs.canton.network/llms.txt
Use this file to discover all available pages before exploring further.
Incident Response Template
Use this template for any incident that does not have a dedicated runbook.- Detect — Identify the symptom. How was the issue discovered? (Alert, user report, routine check.)
- Assess — Determine scope and severity. Is the validator offline? Are transactions failing? Is data at risk?
- Act — Execute the appropriate fix. Follow the relevant runbook below or the troubleshooting methodology.
- Verify — Confirm the fix. Check health endpoints, submit a test transaction, review logs for recurring errors.
- Document — Record what happened, what caused it, what was done, and any follow-up actions.
Runbook: Validator Offline
Detection: Health endpoint returns non-200, monitoring alert fires, or validator disappears from public network status. Assessment:- If the pod is in
CrashLoopBackOff, check logs for the root cause: - If the pod is not scheduled, check for resource quota issues or node availability:
- If the container exited with code 137, it was OOM-killed. Increase memory limits.
- If the container exited with code 1, check logs for a configuration or startup error.
Runbook: Traffic Exhausted
Detection: Transactions fail withPARTICIPANT_TRAFFIC_BELOW_LIMIT or MEDIATOR_SAYS_TX_TIMED_OUT where your validator is in the unresponsiveParties list.
Assessment:
availableTraffic is 0 or traffic limits are exhausted, purchase traffic or enable auto-top-up before transactions can proceed normally.
Action:
- Purchase traffic immediately:
- Enable auto-top-up to prevent recurrence:
- Check that the validator has sufficient Canton Coin balance to fund traffic purchases.
Runbook: Database Disk Full
Detection: PostgreSQL errors containingNo space left on device, or disk usage alerts from monitoring.
Assessment:
- Immediate relief — if the database is completely full, you may need to expand the PVC or free space before PostgreSQL will accept writes again.
- Enable or fix pruning:
- Run a manual VACUUM on PostgreSQL to reclaim space after pruning deletes rows:
Note:
VACUUM FULLlocks tables and requires approximately the same amount of free space as the table being vacuumed. Schedule during a maintenance window.
Runbook: Failed Upgrade Rollback
Detection: After ahelm upgrade, the validator is in a crash loop or producing UnrecoverableError messages.
Assessment:
- Do not restart repeatedly — this can worsen state corruption.
- Roll back to the previous working Helm release:
- If the rollback involves a synchronizer migration (Type 3 upgrade), you may also need to switch the database name back:
- Verify the rollback restored the previous container images:
- Once stable, investigate the upgrade failure before reattempting. Common causes: missing migration configuration, wrong database name, or incompatible values file.
Runbook: CometBFT Consensus Stall
Detection: This applies to Super Validator (SV) operators running CometBFT nodes. Block height stops advancing, or the CometBFT status endpoint shows no new blocks. Assessment:latest_block_height has not changed in several minutes and catching_up is false, consensus has stalled.
Action:
-
Check the number of connected peers:
If peers have dropped below the threshold for consensus (2/3 of validators), blocks cannot be produced.
-
Check for validator signing issues:
-
If your own CometBFT node is the one not signing, restart it:
-
If the stall is network-wide (multiple SV nodes are not signing), coordinate with other SV operators through the
#validator-operationsSlack channel.