Skip to content

Conversation

@cong-or
Copy link
Contributor

@cong-or cong-or commented Dec 15, 2025


Summary

Adds a multi-node P2P testing infrastructure to validate Gossipsub message propagation across a 6-node Hermes mesh.
This establishes a CI-ready foundation for automated P2P regression testing.


What’s New

P2P Testing Infrastructure

  • 6-Node Test Environment
    Full-mesh topology (15 bidirectional connections) with persistent peer IDs

  • PubSub Propagation Testing
    Automated test suite validating message propagation from node 1 → all peers

  • Cross-Platform Support
    OS-aware build system:

    • Linux: Cargo
    • macOS / Windows: Earthly
  • CI-Ready Test Suite
    Deterministic, non-interactive workflows:

    just start-ci && just test-ci

Quick Start

cd p2p-testing
just quickstart    # Build, start mesh, test PubSub

Testing

CI / Automated Testing (Clean State)

just start-ci      # CI mode: always rebuilds, no prompts
just test-ci       # Full validation suite
just clean         # Cleanup

Developer / Interactive Testing

just quickstart                  # First-time setup and test
just start                       # Interactive start (prompts to rebuild)
just test-pubsub-propagation     # Validate message propagation

Custom Tests

See Writing Custom Tests for examples, including:

  • Peer connectivity tests
  • Mesh resilience scenarios
  • Throughput and stress testing

Architecture

  • 6-node full mesh (15 bidirectional connections)
  • Gossipsub v1.2 for PubSub
  • Bootstrap-based discovery with dynamic peer ID synchronization
  • Docker volumes for persistent IPFS keypairs
  • Configurable IPFS settings via environment variables

Next Steps

This PR establishes the testing foundation.
The immediate follow-up is to integrate:

just start-ci && just test-ci

into GitHub Actions, enabling:

  • Automated P2P validation on every PR
  • Early detection of Gossipsub propagation regressions
  • Increased confidence in P2P-related changes before merge

  Docker-based setup for testing IPFS PubSub/P2P across multiple Hermes nodes.

  ## Changes
  - HTTP port configurable via HERMES_HTTP_PORT (default: 5000)
  - Created p2p-testing/ with 3-node Docker setup on isolated network
  - Leverages justfile recipes for builds
…handling

  Enhances the P2P testing infrastructure to ensure cross-platform compatibility
  and provide better error handling for common Docker issues.

  Changes:
  - Add error handling for Docker network and container conflicts in start-nodes.sh
  - Emphasize Earthly (containerized builds) for GLIBC compatibility across different host OS
  - Add comprehensive documentation about cross-platform build requirements
  - Update Dockerfile and docker-compose.yml with clear build instructions
  - Add troubleshooting section for GLIBC errors in README

  The scripts now detect and provide helpful error messages for:
  - Network subnet overlaps with existing Docker networks
  - Container name conflicts from previous runs
  - Guidance to use Earthly builds instead of local cargo builds

  This ensures the P2P testing environment works reliably across different
  development environments (Fedora, Ubuntu, macOS, etc.) by building binaries
  in a controlled container environment that matches the Docker runtime GLIBC version.
  Enable TCP, QUIC, and DNS transports when initializing IPFS node.
  Update hermes-ipfs dependency to use transport enable methods branch.

  Changes:
  - Add enable_tcp(), enable_quic(), enable_dns() calls in ipfs/mod.rs
  - Update Cargo.toml to use feat/hermes-ipfs-transport-enable-methods branch

  Fixes "Multiaddr is not supported" errors preventing P2P connections.

  Related to: #704
  The IPFS initialization thread was calling std::process::exit(0) when the
  ipfs_command_handler task completed, causing the entire hermes process to
  exit cleanly after 1-2 minutes of runtime.

  Root cause:
  - IPFS node spawned a background thread (ipfs/mod.rs:101)
  - Thread ran ipfs_command_handler in a tokio runtime
  - When handler task finished, tokio::join! completed
  - Thread then called std::process::exit(0), killing entire process

  This resulted in:
  - Nodes exiting with status 0 (clean exit, no error logs)
  - P2P connections working initially but then everything stopped
  - Consistent timing (~1-2 minutes until exit)

  Fix:
  - Removed std::process::exit(0) call
  - Added proper error logging for IPFS thread failures
  - Thread now returns naturally instead of killing the process

  Impact:
  - Nodes now run indefinitely as intended
  - Multi-node P2P testing infrastructure is stable
  - PubSub discovery and connectivity working correctly
…or-p2p-features

Resolved conflicts:
- hermes/bin/Cargo.toml: Keep transport enable methods branch
- hermes/Cargo.lock: Regenerated with updated dependencies
  Enable custom P2P bootstrap configuration for multi-node testing:
  - Add IPFS_BOOTSTRAP_PEERS environment variable support for custom peer lists
  - Update bootstrap() to accept optional custom peers parameter
  - Add initialize-p2p.sh script to extract peer IDs and generate bootstrap config
  - Update hermes-ipfs dependency from 0.0.6 to 0.0.8

  This allows nodes in Docker environments to bootstrap to each other
  instead of public IPFS nodes, enabling isolated P2P testing.
   justfile

   Implements core P2P infrastructure improvements:

   - Persistent IPFS keypairs (~/.hermes/ipfs/keypair) for stable peer IDs
   - Bootstrap retry logic (10s interval, 10 max) for automatic reconnection
   - Port 4001 listening configuration for P2P connections
   - Comprehensive justfile with 17 commands (build, test, monitor)
   - PubSub testing command verifying Gossipsub v1.2 and peer connections
   - Updated docker-compose.yml with persistent peer IDs

   Dependencies:
   - hermes-ipfs updated to feat/hermes-ipfs-persistent-keypair branch

   Verified on 3-node Docker setup with full mesh connectivity.
  - Add comprehensive documentation to justfile header and commands
  - Reduce README from 327 to 56 lines (points to justfile)
  - Remove legacy shell scripts directory (consolidated into justfile)
  - Remove obsolete files (.env.bootstrap, docker-compose.yml.backup)

  All documentation now in justfile with organized sections, inline
  comments, and detailed help command. README provides quick start only
  - Auto-subscribe all nodes to documents.new topic on startup to form
    Gossipsub mesh (requires mesh_n=6 peers for optimal operation)
  - Add nodes 4-6 to reach minimum Gossipsub mesh size
  - Fix DHT provider check to accept local node in isolated networks
  - Extend DHT provider timeout and add detailed logging
  - Document why 6 nodes are required in README and justfile

  This allows PubSub publish to succeed by ensuring the topic mesh is
  formed before any POST requests, and relaxes DHT provider requirements
  for small test networks where nodes don't replicate content proactively.
  PubSub was publishing successfully but messages weren't being received
  because the doc-sync module was missing the hermes:ipfs/event.on-topic
  handler. The system dispatches OnTopicEvent but the module only exported
  hermes:doc-sync/event, not hermes:ipfs/event.

  Changes:
  - Add hermes:ipfs/event export to doc-sync module world definition
  - Implement on-topic event handler with detailed logging
  - Add comprehensive logging throughout PubSub publish path
  - Enhance error reporting in doc_sync host publish function
  - Update p2p-testing justfile with parallel module packaging
  - Add test-pubsub-propagation recipe for end-to-end testing
  - Update docker-compose bootstrap peers with current peer IDs

  The on-topic handler now logs all received PubSub messages with topic,
  size, and message preview, making it easy to verify propagation across
  the 6-node test mesh.

  Root cause: Mismatch between dispatched event (OnTopicEvent calling
  hermes:ipfs/event.on-topic) and module exports (only hermes:doc-sync/event).
  Messages were successfully published and routed by Gossipsub but dropped
  at the module boundary due to missing handler.
  PubSub messages were published successfully but never received because
  the doc-sync module was missing the hermes:ipfs/event.on-topic handler.
  The runtime dispatches PubSub messages as OnTopicEvent, but the module
  only exported hermes:doc-sync/event.

  Additionally, docker-compose.yml has hardcoded bootstrap peer IDs that
  become stale when volumes are deleted, causing [wrong peer id] errors.

  Changes:
  - Add hermes:ipfs/event export and on-topic handler to doc-sync module
  - Add detailed PubSub logging throughout publish/receive path
  - Add init-bootstrap command to sync docker-compose.yml after volume deletion
  - Document PR#694 persistent keypairs (already working, just needed docs)
  - Clarify when init-bootstrap is needed (after clean, not after restart)
…g in CI

  Three critical fixes to P2P testing infrastructure:

  1. init-bootstrap: Detect both Generated and Loaded keypair messages
     - Previously only matched Generated new keypair with Peer ID:
     - Now handles Loaded keypair with Peer ID: for persistent volumes
     - Fixes bootstrap discovery after first run when keypairs exist

  2. test-pubsub-propagation: Fix bash arithmetic causing premature exit
     - Changed ((RECEIVED_COUNT++)) to RECEIVED_COUNT=1
     - With set -euo pipefail, post-increment returned 0 causing exit code 1
     - Test now correctly reports all 5 nodes receiving messages

  3. CI workflow: Add bootstrap init and message propagation test
     - start-ci now runs init-bootstrap to sync peer IDs
     - test-ci now includes test-pubsub-propagation for end-to-end verification
     - Ensures CI validates actual PubSub message delivery, not just infrastructure

  These changes enable reliable CI testing of P2P features with proper
  bootstrap configuration and functional validation of message propagation
  across all 6 nodes.
  Add automatic cleanup to start-ci command to guarantee reproducible
  CI environment. The command now checks for running nodes and executes
  docker compose down -v before building and starting.

  This ensures CI runs are always deterministic by:
  - Removing stale containers and volumes
  - Forcing fresh peer identity generation
  - Preventing state pollution between runs

  Simplifies CI workflow to single command: just start-ci && just test-ci
…ation

  - Add visual HTTP POST flow diagram explaining why Node 1 is publisher
  - Enhance test-pubsub-propagation with 4 insight sections:
    * Propagation timeline with receive timestamps
    * Network statistics (message size, success rate, latency)
    * Live log preview showing PubSub activity
    * Peer connection matrix showing full mesh
  - Add educational explanations of Gossipsub, PubSub topics, and CIDs
  - Add workflow diagram to README showing daily command flow
  - Clarify interactive mode (start) vs CI mode (start-ci) throughout docs
  - Emphasize peer ID lifecycle: preserved with 'stop', regenerated after 'clean'
  - Update help command with important notes section

  Makes P2P testing infrastructure more accessible and educational for new users.
  - Add 7 reusable helper functions to eliminate code duplication
  - Refactor test-pubsub-core using helpers (65 → 35 lines)
  - Standardize color code format across 8 recipes
  - Add curl command explanation to PubSub test visualization
  - Fix documentation references for test-pubsub-propagation

  No functional changes. Improves maintainability and readability.
…commands

   Remove dashboard and health-check commands (~195 lines) since test-pubsub-propagation
   already validates all necessary functionality (nodes running status, mesh connectivity
   via message propagation, gossipsub activity via successful delivery, end-to-end P2P).

   Fix HTTP endpoint health check to include required Host header for Hermes gateway.
…second timing

  Update test-pubsub to publish messages and verify reception on all nodes (minimal output for CI), matching test-pubsub-propagation functionality. Remove redundant test-pubsub-core
  command.

  Improve timing precision from seconds to milliseconds using date +%s%3N, showing actual sub-second propagation delays (0.234s instead of 0s) in visualization and final status.
  Restructure README with TL;DR and fast restart workflows. Fix outdated
  command references (dashboard, test-pubsub-core) and clarify prerequisites
  (Rust toolchain, not Earthly). Remove duplicate entries and confusing
  peer ID examples. Add TODOs for GitHub Actions integration.

  Next task: integrate CI workflow into GitHub Actions runners for automated PR testing.
…ection

  - Restructure README with TL;DR for minimal cognitive load
  - Add comprehensive comments to docker-compose.yml and Dockerfile
  - Auto-detect platform in build commands (Linux uses cargo, Mac/Windows uses Earthly)
  - Fix outdated references and broken commands
  - Add fast restart workflow documentation
  - Add GitHub Actions integration TODOs
  - Add TL;DR quickstart and fast restart workflows
  - Auto-detect platform for builds (Linux cargo, Mac/Windows Earthly)
  - Fix non-existent command references and outdated line numbers
  - Add comprehensive inline comments to docker-compose
  - Expand bootstrap peer discovery docs for production deployment
  - Extract MESSAGE_PREVIEW_MAX_LEN constant and create format_message_preview() helper
  - Make IPFS listen port configurable via IPFS_LISTEN_PORT env var (default: 4001)
  - Make retry settings configurable via IPFS_RETRY_INTERVAL_SECS and IPFS_MAX_RETRIES
  - Document exponential backoff strategy for DHT provider queries

  Improves maintainability and configurability without changing behavior.
@cong-or cong-or linked an issue Dec 15, 2025 that may be closed by this pull request
@cong-or cong-or changed the title 704 task multi node testing infrastructure for p2p features feat(hermes): multi node testing infrastructure for p2p features Dec 15, 2025
@cong-or cong-or added the squad: hermetics Hermes Backend, System Development & Integration Team label Dec 15, 2025
@cong-or cong-or added this to Catalyst Dec 15, 2025
  - Add env_var_or() helper to eliminate duplicate env var parsing
  - Extract configure_listening_address() and connect_to_bootstrap_peers() from init()
  - Extract constants: IPFS_DATA_DIR, KEYPAIR_FILENAME, DEFAULT_APP_NAME, DEFAULT_MESH_TOPIC
  - Simplify retry_bootstrap_connections() with early returns and cleaner logic
…onstraints

  - Explain what bootstrap peers are and their role in network formation
  - Document retry strategy for bootstrap connections
  - Add critical warning: public IPFS nodes cannot be used for Hermes PubSub
  - Explain why DHT server mode is required for Gossipsub mesh formation
  - Document auto-subscription strategy to avoid cold start problem

  Emphasizes the key constraint that Gossipsub requires custom Hermes bootstrap
  peers since public IPFS nodes don't subscribe to application-specific topics.
…onstraints

  - Explain what bootstrap peers are and their role in network formation
  - Document retry strategy for bootstrap connections
  - Add critical warning: public IPFS nodes cannot be used for Hermes PubSub
  - Explain why DHT server mode is required for Gossipsub mesh formation
  - Document auto-subscription strategy to avoid cold start problem

  Emphasizes the key constraint that Gossipsub requires custom Hermes bootstrap
  peers since public IPFS nodes don't subscribe to application-specific topics.
  - Add Writing Custom Tests section with examples (connectivity, resilience, throughput)
  - Include template and best practices for extending justfile tests
  - Fix ANSI escape codes not rendering (add -e flag to echo commands)
  - Note: proper API testing framework planned when endpoints finalized
  - Explain mesh_n parameter and its effect on Gossipsub operations
  - Add TL;DR sections to bootstrap discovery and PubSub warnings
  - Clarify cross-platform builds (cargo vs Earthly) and GLIBC compatibility
  - Add DNS analogy for bootstrap peers concept
  - List explicit 5-step post document workflow
  - Explain logging levels (debug = P2P diagnostics, info = app events)
  - Explain mesh_n parameter and its effect on Gossipsub operations
  - Add TL;DR sections to bootstrap discovery and PubSub warnings
  - Clarify cross-platform builds (cargo vs Earthly) and GLIBC compatibility
  - Add DNS analogy for bootstrap peers concept
  - List explicit 5-step post document workflow
  - Explain logging levels (debug = P2P diagnostics, info = app events)
- Replace confusing need 2+ / 1+ messages with clear explanations
- Old: got 1 provider(s), need 2+
- New: found 1 provider(s), waiting for ourselves to appear in DHT query results
- The 2+ / 1+ thresholds did not match actual success conditions
  - Remove redundant build-all/build-images steps (quickstart handles this)
  - Clarify difference between nuclear option (prunes all Docker resources)
    vs just clean (only removes p2p-testing volumes)
  - Explain mesh_n, bootstrap peers (DNS analogy), and 5-step post workflow
  - Fix confusing DHT log: need 2+ waiting for ourselves to appear
  - Add backticks around PubSub and function names (clippy doc-markdown)
  - Simplify TROUBLESHOOTING reset (quickstart handles rebuilds)
  - Clarify quickstart uses existing binaries (run just build-all after code changes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

squad: hermetics Hermes Backend, System Development & Integration Team

Projects

Status: New

Development

Successfully merging this pull request may close these issues.

🛠️ [TASK] : Multi-Node Testing Infrastructure for P2P Features

2 participants