Troubleshooting (Privacy‑First)¶
Summary¶
This page helps you diagnose and fix common issues when building and running Meridian Runtime, with minimal, privacy‑respecting steps. It also cross‑links to the API Reference where appropriate and includes corroborated metrics/log event names emitted by the runtime.
If a problem isn't covered here, open an issue with full context (OS, Python, commands, logs) or start a discussion.
Quick checklist¶
-
Verify environment
- Python 3.11+ is installed and active in your virtualenv.
- Dependencies installed via
uv sync(orpip install -e ".[dev]").
-
Reproduce with a minimal example
- Simplify the graph: few nodes, clear edge capacities/policies.
- Replace real data with dummy values; remove payload contents.
-
Collect redacted artifacts
- Structured logs: redact values; keep keys, counts, and policy names.
- Config snippets: include only relevant keys; mask secrets.
- Metrics snapshots: numeric values; avoid PII in labels.
-
Compare observed vs expected
- Write down what actually happens vs what you expect.
-
Sanity checks
-
Lint/type/tests pass:
make lintuv run pytest -q
-
Docs build:
make docs-builduv run mkdocs build
-
Common issues and fixes¶
Environment and imports¶
Symptoms
ModuleNotFoundErrorfor project modules or extras.- Examples crash or can't import modules.
Fix
-
Ensure your env is active:
source .venv/bin/activate(or platform equivalent)
-
Re-install dependencies:
uv sync
-
Verify Python version:
python --version(must be 3.11+)
-
Run examples using module form to ensure proper PYTHONPATH:
git clone https://github.com/GhostWeaselLabs/meridian-runtime-examples.git && cd meridian-runtime-examples && uv run python examples/hello_graph/main.py
See also
Edge overflow or backpressure surprises¶
What the runtime does¶
- Edges are capacity‑bounded FIFO queues with policy‑controlled overflow (see API: Edge).
- If you don't provide a policy on enqueue, the runtime applies internal policy implementations or the edge's configured
default_policy. - Runtime behavior and outcomes (see API: PutResult):
Block→PutResult.BLOCKED: producer should yield/wait when full.Drop→PutResult.DROPPED: item is discarded when full.Latest→PutResult.REPLACED: keep only the most recent item when full.Coalesce(fn)→PutResult.COALESCED: merge old/new via fn when full.
Metrics¶
(emitted per edge_id: "src_node:src_port->dst_node:dst_port")
edge_enqueued_totaledge_dequeued_totaledge_dropped_totaledge_queue_depth(gauge)edge_blocked_time_seconds(histogram)
Representative log events¶
edge.enqueueedge.replaceedge.coalesceedge.coalesce_erroredge.validation_failededge.dropedge.blockededge.dequeue
Symptoms
- Messages stall or appear to drop.
- Queue depth near capacity; lower throughput than expected.
Likely causes
- Capacity too small for bursty workloads.
- Policy mismatch for the workload (
DropvsBlockvsLatestvsCoalesce). - Upstream/downstream rate mismatch or blocking operations.
What to try
-
Inspect and adjust edge configuration
- Increase capacity for bursty edges.
- Use
Blockfor strict delivery (be aware of backpressure). - Use
Latestfor freshness when only the newest matters. - Use
Coalesce(fn)to merge bursts into fewer items.
-
Balance node workloads
- Move blocking I/O to async or dedicated executors; batch where appropriate.
-
Add metrics/logs
- Track enqueued/dequeued/dropped, blocked time, and depth per edge.
-
Validate fairness
- Ensure scheduler policies aren't starving a node.
What to collect (redacted)
- Edge definitions with capacity and policy names.
-
Logs:
event="edge.enqueue" edge_id="A:out->B:in"event="edge.replace" edge_id="A:out->B:in"event="edge.coalesce" edge_id="A:out->B:in"event="edge.drop" edge_id="A:out->B:in"
-
Metrics snapshot:
edge_queue_depth,edge_dropped_total,edge_blocked_time_seconds.
See also
Scheduler starvation or unfairness¶
Symptoms
- Certain nodes rarely run or lag behind others.
- Control‑plane tasks take too long to apply.
Likely causes
- Long‑running tasks monopolize execution.
- Blocking operations in async contexts causing stalls.
- High contention on shared resources.
What to try
-
Audit node work
- Break long tasks into smaller units.
- Use asyncio‑friendly APIs; offload blocking calls.
-
Control‑plane prioritization
- Ensure control operations have clear priority.
-
Tune fairness strategy
- Try round‑robin vs weighted fairness; adjust weights if supported.
What to collect (redacted)
- Summary of node durations (ranges).
- Structured logs around scheduling decisions (no payloads).
- Minimal graph snapshot: node names, edge topology.
- Scheduler metrics:
scheduler_runnable_nodes,scheduler_loop_latency_seconds.
See also
Shutdown hangs or unclean teardown¶
Symptoms
- Runtime fails to exit promptly.
- Nodes'
on_stophooks not called or appear stuck. - In‑flight work not draining.
Likely causes
- Blocking in shutdown paths.
- Pending tasks waiting on unbounded/stuck conditions.
- Missing timeouts or cancellation guards.
What to try
-
Add timeouts to
on_stopand drain operations -
Make
on_stopidempotent- Avoid enqueuing new work on shutdown.
-
Emit lifecycle logs
- Provide start/end markers for shutdown sequences.
-
Use per‑edge policies
- Drain or drop remaining work explicitly.
-
Implement proper shutdown handling:
What to collect (redacted)
-
Logs:
event="scheduler.shutdown_start"event="scheduler.shutdown_requested"event="scheduler.shutdown_complete"event="node.start"event="node.stop"
-
Note tasks still pending after timeout.
See also
Validation errors or unexpected payload handling¶
What the runtime does
- If an edge has a
PortSpec, values (orMessage.payload) are validated during enqueue. Mismatch logsedge.validation_failedand raisesTypeError.
Symptoms
- Frequent validation failures.
- Error events too sparse or too verbose.
Likely causes
- Mismatched schema vs runtime data shape.
- Validation at the wrong boundary.
- Error policy not configured as intended.
What to try
-
Validate at boundaries
- Use
PortSpecat ingress/egress and, when applicable, schema validators in your producer/consumer code.
- Use
-
Tighten/loosen schema choices
- Optional vs required fields as systems evolve.
-
Confirm privacy posture
- No payloads in error logs; attach only metadata.
-
Use subgraph validation before execution:
What to collect (redacted)
- Schema shape (names and types only; no values).
-
Error logs without payload contents:
event="edge.validation_failed" edge_id="A:out->B:in"
-
Validation issues from
Subgraph.validate().
See also
Logging/tracing too verbose or too sparse¶
Symptoms
- High log volume impacting performance.
- Not enough information to diagnose issues.
What to try
-
Right‑size log levels
INFOfor lifecycle,WARN/ERRORfor anomalies,DEBUGsparingly.
-
Adopt key conventions
event,node_id,edge_id,policy,counts,durations.
-
Sampling
- Apply sampling for repetitive debug events.
What to collect (redacted)
- A short sequence (last 200–500 lines) with structured entries.
- Note which events are missing for diagnosis.
See also
Performance regressions¶
Symptoms
- Increased latency or reduced throughput vs a previous run.
- Hot CPU or I/O saturation.
Likely causes
- New blocking paths introduced.
- Higher cardinality in metric labels or verbose logging.
- Insufficient edge capacity or missing coalescing for bursty flows.
What to try
-
Revert to known‑good settings
- Compare metrics before/after a change.
-
Profile hot paths (locally)
- Identify blocking calls; offload or batch.
-
Reduce label cardinality
- Keep metrics labels low and stable.
What to collect (redacted)
- Before/after metrics:
throughput,queue_depth,dropped counts,latency percentiles(if available). - Configuration diffs: policy names, capacities.
- Scheduler metrics:
scheduler_loop_latency_seconds,scheduler_runnable_nodes.
See also
Debugging¶
- Enable debug logs in your observability configuration.
- Use metrics to inspect edge depths and drops.
-
Use module execution for tests/examples to avoid path issues:
uv run pytestgit clone https://github.com/GhostWeaselLabs/meridian-runtime-examples.git && cd meridian-runtime-examples && uv run python examples/pipeline_demo/main.py
Key metrics to monitor:
- Edge metrics:
edge_queue_depth,edge_dropped_total,edge_blocked_time_seconds - Scheduler metrics:
scheduler_runnable_nodes,scheduler_loop_latency_seconds,scheduler_priority_applied_total - Node metrics:
node_messages_total,node_errors_total,node_tick_duration_seconds
See also
Minimal reproduction strategy¶
-
Start small
- One or two nodes, one edge, a single message type.
-
Replace data
- Use shape‑equivalent dummy values; avoid real payloads.
-
Fix the seed
- Avoid non‑determinism in tests unless necessary.
-
Log only essentials
- Lifecycle transitions, scheduling decisions, edge
enqueues/replaces/coalesce, error summaries.
- Lifecycle transitions, scheduling decisions, edge
Conceptual example (sanitized)
- Configure an edge with
capacity=10andpolicy="Drop". - Send 100 synthetic messages; confirm
dropped countsrise. - Observe metrics and expected behavior.
Safe artifacts for triage¶
-
Environment
- OS, Python, Meridian versions; tooling versions.
-
Graph topology snapshot
- Node/edge names, capacities, policies; no payload schemas required.
-
Logs (redacted)
- Keep keys; redact values:
<REDACTED>.
- Keep keys; redact values:
-
Metrics
- Numeric
counters/gauges/histograms; avoid PII in labels.
- Numeric
-
Config differences
- Show changed keys and
enum/booleanvalues; mask secrets or replace withCHECKSUM(...)orPLACEHOLDER.
- Show changed keys and
Never include
- Secrets, tokens, credentials.
- PII or business data payloads.
- Proprietary identifiers without anonymization.
Diagnostics to include when asking for help¶
- OS and Python version.
- Exact command(s) you ran.
- Minimal snippet or steps to reproduce.
- Full error output and any relevant logs (redacted).
- Any local changes or configuration differences.
Known pitfalls¶
-
Mixing package managers
- Prefer
uvfor this repo to avoid environment drift; don't interleavepipunless necessary.
- Prefer
-
Stale caches
- Clear
.pytest_cache,.mypy_cache,.ruff_cache, and any build artifacts if behavior seems inconsistent.
- Clear
-
Renamed documentation
- After file moves/renames, update internal links and nav in
mkdocs.yml. In strict mode, broken links abort the build.
- After file moves/renames, update internal links and nav in