Troubleshooting (Privacy‑First)¶

Summary¶

This page helps you diagnose and fix common issues when building and running Meridian Runtime, with minimal, privacy‑respecting steps. It also cross‑links to the API Reference where appropriate and includes corroborated metrics/log event names emitted by the runtime.

If a problem isn't covered here, open an issue with full context (OS, Python, commands, logs) or start a discussion.

Quick checklist¶

Verify environment
- Python 3.11+ is installed and active in your virtualenv.
- Dependencies installed via uv sync (or pip install -e ".[dev]").
Reproduce with a minimal example
- Simplify the graph: few nodes, clear edge capacities/policies.
- Replace real data with dummy values; remove payload contents.
Collect redacted artifacts
- Structured logs: redact values; keep keys, counts, and policy names.
- Config snippets: include only relevant keys; mask secrets.
- Metrics snapshots: numeric values; avoid PII in labels.
Compare observed vs expected
- Write down what actually happens vs what you expect.
Sanity checks
- Lint/type/tests pass:
  - make lint
  - uv run pytest -q
- Docs build:
  - make docs-build
  - uv run mkdocs build

Common issues and fixes¶

Environment and imports¶

Symptoms

ModuleNotFoundError for project modules or extras.
Examples crash or can't import modules.

Fix

Ensure your env is active:
- source .venv/bin/activate (or platform equivalent)
Re-install dependencies:
- uv sync
Verify Python version:
- python --version (must be 3.11+)
Run examples using module form to ensure proper PYTHONPATH:
- git clone https://github.com/GhostWeaselLabs/meridian-runtime-examples.git && cd meridian-runtime-examples && uv run python examples/hello_graph/main.py

See also

Concepts overview

Edge overflow or backpressure surprises¶

What the runtime does¶

Edges are capacity‑bounded FIFO queues with policy‑controlled overflow (see API: Edge).
If you don't provide a policy on enqueue, the runtime applies internal policy implementations or the edge's configured default_policy.
Runtime behavior and outcomes (see API: PutResult):
- Block → PutResult.BLOCKED: producer should yield/wait when full.
- Drop → PutResult.DROPPED: item is discarded when full.
- Latest → PutResult.REPLACED: keep only the most recent item when full.
- Coalesce(fn) → PutResult.COALESCED: merge old/new via fn when full.

Metrics¶

(emitted per edge_id: "src_node:src_port->dst_node:dst_port")

edge_enqueued_total
edge_dequeued_total
edge_dropped_total
edge_queue_depth (gauge)
edge_blocked_time_seconds (histogram)

Representative log events¶

edge.enqueue
edge.replace
edge.coalesce
edge.coalesce_error
edge.validation_failed
edge.drop
edge.blocked
edge.dequeue

Symptoms

Messages stall or appear to drop.
Queue depth near capacity; lower throughput than expected.

Likely causes

Capacity too small for bursty workloads.
Policy mismatch for the workload (Drop vs Block vs Latest vs Coalesce).
Upstream/downstream rate mismatch or blocking operations.

What to try

Inspect and adjust edge configuration
- Increase capacity for bursty edges.
- Use Block for strict delivery (be aware of backpressure).
- Use Latest for freshness when only the newest matters.
- Use Coalesce(fn) to merge bursts into fewer items.
Balance node workloads
- Move blocking I/O to async or dedicated executors; batch where appropriate.
Add metrics/logs
- Track enqueued/dequeued/dropped, blocked time, and depth per edge.
Validate fairness
- Ensure scheduler policies aren't starving a node.

What to collect (redacted)

Edge definitions with capacity and policy names.
Logs:
- event="edge.enqueue" edge_id="A:out->B:in"
- event="edge.replace" edge_id="A:out->B:in"
- event="edge.coalesce" edge_id="A:out->B:in"
- event="edge.drop" edge_id="A:out->B:in"
Metrics snapshot: edge_queue_depth, edge_dropped_total, edge_blocked_time_seconds.

See also

Scheduler starvation or unfairness¶

Symptoms

Certain nodes rarely run or lag behind others.
Control‑plane tasks take too long to apply.

Likely causes

Long‑running tasks monopolize execution.
Blocking operations in async contexts causing stalls.
High contention on shared resources.

What to try

Audit node work
- Break long tasks into smaller units.
- Use asyncio‑friendly APIs; offload blocking calls.
Control‑plane prioritization
- Ensure control operations have clear priority.
Tune fairness strategy
- Try round‑robin vs weighted fairness; adjust weights if supported.

What to collect (redacted)

Summary of node durations (ranges).
Structured logs around scheduling decisions (no payloads).
Minimal graph snapshot: node names, edge topology.
Scheduler metrics: scheduler_runnable_nodes, scheduler_loop_latency_seconds.

See also

Shutdown hangs or unclean teardown¶

Symptoms

Runtime fails to exit promptly.
Nodes' on_stop hooks not called or appear stuck.
In‑flight work not draining.

Likely causes

Blocking in shutdown paths.
Pending tasks waiting on unbounded/stuck conditions.
Missing timeouts or cancellation guards.

What to try

Add timeouts to on_stop and drain operations
Make on_stop idempotent
- Avoid enqueuing new work on shutdown.
Emit lifecycle logs
- Provide start/end markers for shutdown sequences.
Use per‑edge policies
- Drain or drop remaining work explicitly.

Implement proper shutdown handling:

Python

try:
    scheduler.run()
except KeyboardInterrupt:
    print("Shutting down gracefully...")
    scheduler.shutdown()
except Exception as e:
    print(f"Error during execution: {e}")
    scheduler.shutdown()
    raise

What to collect (redacted)

Logs:
- event="scheduler.shutdown_start"
- event="scheduler.shutdown_requested"
- event="scheduler.shutdown_complete"
- event="node.start"
- event="node.stop"
Note tasks still pending after timeout.

See also

Validation errors or unexpected payload handling¶

What the runtime does

If an edge has a PortSpec, values (or Message.payload) are validated during enqueue. Mismatch logs edge.validation_failed and raises TypeError.

Symptoms

Frequent validation failures.
Error events too sparse or too verbose.

Likely causes

Mismatched schema vs runtime data shape.
Validation at the wrong boundary.
Error policy not configured as intended.

What to try

Validate at boundaries
- Use PortSpec at ingress/egress and, when applicable, schema validators in your producer/consumer code.
Tighten/loosen schema choices
- Optional vs required fields as systems evolve.
Confirm privacy posture
- No payloads in error logs; attach only metadata.

Use subgraph validation before execution:

Python

issues = subgraph.validate()
if issues:
    print("Validation issues found:")
    for issue in issues:
        print(f"  {issue.level}: {issue.message}")
    exit(1)

What to collect (redacted)

Schema shape (names and types only; no values).
Error logs without payload contents:
- event="edge.validation_failed" edge_id="A:out->B:in"
Validation issues from Subgraph.validate().

See also

Logging/tracing too verbose or too sparse¶

Symptoms

High log volume impacting performance.
Not enough information to diagnose issues.

What to try

Right‑size log levels
- INFO for lifecycle, WARN/ERROR for anomalies, DEBUG sparingly.
Adopt key conventions
- event, node_id, edge_id, policy, counts, durations.
Sampling
- Apply sampling for repetitive debug events.

What to collect (redacted)

A short sequence (last 200–500 lines) with structured entries.
Note which events are missing for diagnosis.

See also

Concepts: Observability

Performance regressions¶

Symptoms

Increased latency or reduced throughput vs a previous run.
Hot CPU or I/O saturation.

Likely causes

New blocking paths introduced.
Higher cardinality in metric labels or verbose logging.
Insufficient edge capacity or missing coalescing for bursty flows.

What to try

Revert to known‑good settings
- Compare metrics before/after a change.
Profile hot paths (locally)
- Identify blocking calls; offload or batch.
Reduce label cardinality
- Keep metrics labels low and stable.

What to collect (redacted)

Before/after metrics: throughput, queue_depth, dropped counts, latency percentiles (if available).
Configuration diffs: policy names, capacities.
Scheduler metrics: scheduler_loop_latency_seconds, scheduler_runnable_nodes.

See also

Debugging¶

Enable debug logs in your observability configuration.
Use metrics to inspect edge depths and drops.
Use module execution for tests/examples to avoid path issues:
- uv run pytest
- git clone https://github.com/GhostWeaselLabs/meridian-runtime-examples.git && cd meridian-runtime-examples && uv run python examples/pipeline_demo/main.py

Key metrics to monitor:

Edge metrics: edge_queue_depth, edge_dropped_total, edge_blocked_time_seconds
Scheduler metrics: scheduler_runnable_nodes, scheduler_loop_latency_seconds, scheduler_priority_applied_total
Node metrics: node_messages_total, node_errors_total, node_tick_duration_seconds

See also

Minimal reproduction strategy¶

Start small
- One or two nodes, one edge, a single message type.
Replace data
- Use shape‑equivalent dummy values; avoid real payloads.
Fix the seed
- Avoid non‑determinism in tests unless necessary.
Log only essentials
- Lifecycle transitions, scheduling decisions, edge enqueues/replaces/coalesce, error summaries.

Conceptual example (sanitized)

Configure an edge with capacity=10 and policy="Drop".
Send 100 synthetic messages; confirm dropped counts rise.
Observe metrics and expected behavior.

Safe artifacts for triage¶

Environment
- OS, Python, Meridian versions; tooling versions.
Graph topology snapshot
- Node/edge names, capacities, policies; no payload schemas required.
Logs (redacted)
- Keep keys; redact values: <REDACTED>.
Metrics
- Numeric counters/gauges/histograms; avoid PII in labels.
Config differences
- Show changed keys and enum/boolean values; mask secrets or replace with CHECKSUM(...) or PLACEHOLDER.

Never include

Secrets, tokens, credentials.
PII or business data payloads.
Proprietary identifiers without anonymization.

Diagnostics to include when asking for help¶

OS and Python version.
Exact command(s) you ran.
Minimal snippet or steps to reproduce.
Full error output and any relevant logs (redacted).
Any local changes or configuration differences.

Known pitfalls¶

Mixing package managers
- Prefer uv for this repo to avoid environment drift; don't interleave pip unless necessary.
Stale caches
- Clear .pytest_cache, .mypy_cache, .ruff_cache, and any build artifacts if behavior seems inconsistent.
Renamed documentation
- After file moves/renames, update internal links and nav in mkdocs.yml. In strict mode, broken links abort the build.

Troubleshooting (Privacy‑First)¶

Summary¶

Quick checklist¶

Common issues and fixes¶

Environment and imports¶

Edge overflow or backpressure surprises¶

What the runtime does¶

Metrics¶

Representative log events¶

Scheduler starvation or unfairness¶

Shutdown hangs or unclean teardown¶

Validation errors or unexpected payload handling¶

Logging/tracing too verbose or too sparse¶

Performance regressions¶

Debugging¶

Minimal reproduction strategy¶

Safe artifacts for triage¶

Diagnostics to include when asking for help¶

Known pitfalls¶

See also¶