Skip to content

Meridian Runtime Troubleshooting Guide (Privacy‑First)

Owner: GhostWeasel (Lead: doubletap-dave) Audience: Users and contributors Status: Stable

This guide helps you diagnose common issues with the Meridian runtime while protecting sensitive information. It emphasizes minimal, reproducible steps and privacy‑respecting artifacts.

Key principles - Keep it minimal: reduce to the smallest failing example. - Keep it safe: do not share payload contents, secrets, or PII. - Keep it structured: use clear steps, observed vs expected, and redacted logs.


1) Quick Checklist

  • Verify environment
  • Python 3.11+, toolchain installed as documented.
  • Meridian version and dependency versions noted.
  • Reproduce with a minimal example
  • Simplify the graph: few nodes, clear edge bounds/policies.
  • Replace real data with dummy values; remove payload contents.
  • Collect redacted artifacts
  • Structured logs: redact values; keep keys, counts, and policy names.
  • Config snippets: include only relevant keys; mask secrets.
  • Metrics snapshots: numeric values; avoid PII in labels.
  • Compare observed vs expected
  • Write down what actually happens vs what you expect.
  • Consult docs
  • Review README, milestone docs (especially M0), and How to Report Issues.

2) Common Issues and Fixes

A) Edge Overflow or Backpressure Surprises Symptoms - Messages appear to stall or are being dropped. - Logs show “edge overflow” or queue depth near bounds. - Throughput lower than expected.

Likely Causes - Edge bounds too small for bursty workloads. - Overflow policy not aligned with workload (e.g., drop vs block vs latest vs coalesce). - Upstream/downstream rate mismatch or blocking operations in critical paths.

What To Try 1) Inspect edge configuration: - Increase bounds for bursty edges. - Switch to “block” if you need strict delivery (with awareness of backpressure). - Use “latest” or “coalesce” for high‑frequency updates where only the latest state matters. 2) Balance node workloads: - Move blocking I/O to async or dedicated executors. - Batch processing where appropriate. 3) Add lightweight metrics/logs: - Track enqueue/dequeue counts and overflow events per edge. 4) Validate fairness: - Ensure scheduler policies aren’t starving a node.

What to Collect (Redacted) - Edge definitions with bounds and policy names (no payload schemas required). - Logs: - event="edge_overflow", edge_id="E_X", policy="drop", dropped=123 - Metrics snapshot: queue_depth, overflow_count.

B) Scheduler Starvation or Unfairness Symptoms - Certain nodes rarely run or lag behind others. - Control-plane tasks take too long to apply.

Likely Causes - Misconfigured fairness or long‑running tasks monopolizing execution. - Blocking operations in async contexts causing event loop stalls. - High contention on shared resources.

What To Try 1) Audit node work: - Break long tasks into smaller units. - Use asyncio-friendly APIs; offload blocking calls. 2) Verify control-plane prioritization: - Ensure control operations have clear priority. 3) Tune fairness strategy: - If available, try round‑robin vs weighted fairness; adjust weights.

What to Collect (Redacted) - Summary of node durations (ranges). - Structured logs around scheduling decisions (no payloads). - Minimal graph snapshot: node names, edge topology (no secrets).

C) Shutdown Hangs or Unclean Teardown Symptoms - Runtime fails to exit promptly. - Nodes’ on_stop hooks not called or appear stuck. - In‑flight work not draining.

Likely Causes - Blocking operations in shutdown paths. - Pending tasks waiting on unbounded or stuck conditions. - Missing timeouts or cancellation guards.

What To Try 1) Add timeouts to on_stop and drain operations. 2) Ensure idempotent on_stop; avoid new work on shutdown. 3) Emit logs for lifecycle transitions: - on_stop start/end markers. 4) Use policies to drain or drop remaining work per edge.

What to Collect (Redacted) - Logs: - event="shutdown_initiated" - event="node_stopping", node_id=... - event="node_stopped", node_id=... - Note any tasks still pending after timeout.

D) Validation Errors or Unexpected Payload Handling Symptoms - Frequent validation failures. - Error events contain too little or too much detail.

Likely Causes - Mismatched schema vs runtime data shape. - Validation placed at incorrect boundary. - Error policy not configured as intended.

What To Try 1) Validate at boundaries: - Apply TypedDict/Pydantic validators at ingress/egress points. 2) Tighten/loosen schema choices: - Optional fields vs required fields in evolving systems. 3) Confirm privacy posture: - Verify “no payloads in error events” is enforced; attach only metadata.

What to Collect (Redacted) - Schema shape (names and types only; no values). - Error logs without payload contents: - event="validation_error", node_id="N_A", error_type="SchemaMismatch"

E) Logging/Tracing Too Verbose or Too Sparse Symptoms - High log volume impacting performance. - Not enough information to diagnose issues.

Likely Causes - Inappropriate log levels or missing event coverage. - Excessive debug logs in hot paths.

What To Try 1) Right-size log levels: - INFO for lifecycle, WARN/ERROR for anomalies, DEBUG sparingly. 2) Adopt key conventions: - event, node_id, edge_id, policy, counts, durations. 3) Sampling: - Apply sampling for repetitive debug events.

What to Collect (Redacted) - A short sequence (last 200–500 lines) with structured entries. - Note which events are missing for diagnosis.

F) Performance Regressions Symptoms - Increased latency or reduced throughput vs previous runs. - Hot CPU or I/O saturation.

Likely Causes - New blocking paths introduced. - Higher cardinality in metric labels or verbose logging. - Insufficient edge bounds or inefficient coalescing.

What To Try 1) Revert to known-good settings: - Compare metrics before/after a change. 2) Profile hot paths (locally): - Identify blocking calls; offload or batch. 3) Reduce label cardinality: - Keep metrics labels low and stable.

What to Collect (Redacted) - Before/after metrics: throughput, queue_depth, overflow_count, latency percentiles (if available). - Configuration diffs: policy names, bounds.


3) Minimal Reproduction Strategy

1) Start small: - One or two nodes, one edge, a single message type. 2) Replace data: - Use “shape‑equivalent” dummy values; avoid real payloads. 3) Fix the seed: - Avoid non‑determinism in tests unless necessary. 4) Log only essentials: - Lifecycle transitions, scheduling decisions, edge overflows, error summaries.

Example structure (sanitized, conceptual) - Configure an edge with bound=10 and policy="drop". - Send 100 synthetic messages; confirm overflow events. - Observe metrics counts and expected behavior.


4) Safe Artifacts for Triage

  • Environment
  • OS, Python, Meridian versions; tooling versions.
  • Graph topology snapshot
  • Node/edge names, bounds, policies; no payload schemas required.
  • Logs (redacted)
  • Keep keys; redact values: “”.
  • Metrics
  • Numeric counters/gauges/histograms; avoid PII in labels.
  • Config differences
  • Show changed keys and enum/boolean values; mask secrets or replace with CHECKSUM(...) or PLACEHOLDER.

Never include - Secrets, tokens, credentials. - PII or business data payloads. - Proprietary identifiers without anonymization.


5) When to Escalate

Escalate with a request for help if: - You cannot produce a minimal repro. - The runtime hangs or corrupts state even after simplification. - You suspect a scheduler fairness bug or data loss contradictory to policy.

Provide: - Short, sanitized repro or script. - Redacted logs (last 200–500 lines). - Environment details and exact versions. - Observed vs expected behavior.


6) FAQs

Q: How do I check if I’m hitting edge bounds? - Enable structured logs for enqueue/dequeue and overflow events. Inspect queue depth metrics if available.

Q: Should I include payload schemas in a report? - Prefer not to. If necessary, share only field names and types—no values.

Q: What if my logs contain sensitive info? - Redact values. Keep structure, keys, and counts. Replace sensitive strings with or placeholders.

Q: How do I handle blocking I/O? - Use async variants or run in an executor. Ensure on_stop has timeouts and is idempotent.


7) Related Documents

  • How to Report Issues: docs/support/HOW-TO-REPORT-ISSUES.md
  • Governance and Overview (M0): docs/plan/M0-governance-and-overview.md
  • Milestones and Plans: docs/plan/
  • Contributing Guide: docs/contributing/CONTRIBUTING.md

End of document.