-
Notifications
You must be signed in to change notification settings - Fork 3
feat(hermes): multi node testing infrastructure for p2p features #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
cong-or
wants to merge
33
commits into
main
Choose a base branch
from
704-task-multi-node-testing-infrastructure-for-p2p-features
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
feat(hermes): multi node testing infrastructure for p2p features #721
cong-or
wants to merge
33
commits into
main
from
704-task-multi-node-testing-infrastructure-for-p2p-features
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Docker-based setup for testing IPFS PubSub/P2P across multiple Hermes nodes. ## Changes - HTTP port configurable via HERMES_HTTP_PORT (default: 5000) - Created p2p-testing/ with 3-node Docker setup on isolated network - Leverages justfile recipes for builds
…handling Enhances the P2P testing infrastructure to ensure cross-platform compatibility and provide better error handling for common Docker issues. Changes: - Add error handling for Docker network and container conflicts in start-nodes.sh - Emphasize Earthly (containerized builds) for GLIBC compatibility across different host OS - Add comprehensive documentation about cross-platform build requirements - Update Dockerfile and docker-compose.yml with clear build instructions - Add troubleshooting section for GLIBC errors in README The scripts now detect and provide helpful error messages for: - Network subnet overlaps with existing Docker networks - Container name conflicts from previous runs - Guidance to use Earthly builds instead of local cargo builds This ensures the P2P testing environment works reliably across different development environments (Fedora, Ubuntu, macOS, etc.) by building binaries in a controlled container environment that matches the Docker runtime GLIBC version.
Enable TCP, QUIC, and DNS transports when initializing IPFS node. Update hermes-ipfs dependency to use transport enable methods branch. Changes: - Add enable_tcp(), enable_quic(), enable_dns() calls in ipfs/mod.rs - Update Cargo.toml to use feat/hermes-ipfs-transport-enable-methods branch Fixes "Multiaddr is not supported" errors preventing P2P connections. Related to: #704
The IPFS initialization thread was calling std::process::exit(0) when the ipfs_command_handler task completed, causing the entire hermes process to exit cleanly after 1-2 minutes of runtime. Root cause: - IPFS node spawned a background thread (ipfs/mod.rs:101) - Thread ran ipfs_command_handler in a tokio runtime - When handler task finished, tokio::join! completed - Thread then called std::process::exit(0), killing entire process This resulted in: - Nodes exiting with status 0 (clean exit, no error logs) - P2P connections working initially but then everything stopped - Consistent timing (~1-2 minutes until exit) Fix: - Removed std::process::exit(0) call - Added proper error logging for IPFS thread failures - Thread now returns naturally instead of killing the process Impact: - Nodes now run indefinitely as intended - Multi-node P2P testing infrastructure is stable - PubSub discovery and connectivity working correctly
…or-p2p-features Resolved conflicts: - hermes/bin/Cargo.toml: Keep transport enable methods branch - hermes/Cargo.lock: Regenerated with updated dependencies
Enable custom P2P bootstrap configuration for multi-node testing: - Add IPFS_BOOTSTRAP_PEERS environment variable support for custom peer lists - Update bootstrap() to accept optional custom peers parameter - Add initialize-p2p.sh script to extract peer IDs and generate bootstrap config - Update hermes-ipfs dependency from 0.0.6 to 0.0.8 This allows nodes in Docker environments to bootstrap to each other instead of public IPFS nodes, enabling isolated P2P testing.
justfile Implements core P2P infrastructure improvements: - Persistent IPFS keypairs (~/.hermes/ipfs/keypair) for stable peer IDs - Bootstrap retry logic (10s interval, 10 max) for automatic reconnection - Port 4001 listening configuration for P2P connections - Comprehensive justfile with 17 commands (build, test, monitor) - PubSub testing command verifying Gossipsub v1.2 and peer connections - Updated docker-compose.yml with persistent peer IDs Dependencies: - hermes-ipfs updated to feat/hermes-ipfs-persistent-keypair branch Verified on 3-node Docker setup with full mesh connectivity.
- Add comprehensive documentation to justfile header and commands - Reduce README from 327 to 56 lines (points to justfile) - Remove legacy shell scripts directory (consolidated into justfile) - Remove obsolete files (.env.bootstrap, docker-compose.yml.backup) All documentation now in justfile with organized sections, inline comments, and detailed help command. README provides quick start only
- Auto-subscribe all nodes to documents.new topic on startup to form
Gossipsub mesh (requires mesh_n=6 peers for optimal operation)
- Add nodes 4-6 to reach minimum Gossipsub mesh size
- Fix DHT provider check to accept local node in isolated networks
- Extend DHT provider timeout and add detailed logging
- Document why 6 nodes are required in README and justfile
This allows PubSub publish to succeed by ensuring the topic mesh is
formed before any POST requests, and relaxes DHT provider requirements
for small test networks where nodes don't replicate content proactively.
PubSub was publishing successfully but messages weren't being received because the doc-sync module was missing the hermes:ipfs/event.on-topic handler. The system dispatches OnTopicEvent but the module only exported hermes:doc-sync/event, not hermes:ipfs/event. Changes: - Add hermes:ipfs/event export to doc-sync module world definition - Implement on-topic event handler with detailed logging - Add comprehensive logging throughout PubSub publish path - Enhance error reporting in doc_sync host publish function - Update p2p-testing justfile with parallel module packaging - Add test-pubsub-propagation recipe for end-to-end testing - Update docker-compose bootstrap peers with current peer IDs The on-topic handler now logs all received PubSub messages with topic, size, and message preview, making it easy to verify propagation across the 6-node test mesh. Root cause: Mismatch between dispatched event (OnTopicEvent calling hermes:ipfs/event.on-topic) and module exports (only hermes:doc-sync/event). Messages were successfully published and routed by Gossipsub but dropped at the module boundary due to missing handler.
PubSub messages were published successfully but never received because the doc-sync module was missing the hermes:ipfs/event.on-topic handler. The runtime dispatches PubSub messages as OnTopicEvent, but the module only exported hermes:doc-sync/event. Additionally, docker-compose.yml has hardcoded bootstrap peer IDs that become stale when volumes are deleted, causing [wrong peer id] errors. Changes: - Add hermes:ipfs/event export and on-topic handler to doc-sync module - Add detailed PubSub logging throughout publish/receive path - Add init-bootstrap command to sync docker-compose.yml after volume deletion - Document PR#694 persistent keypairs (already working, just needed docs) - Clarify when init-bootstrap is needed (after clean, not after restart)
…g in CI
Three critical fixes to P2P testing infrastructure:
1. init-bootstrap: Detect both Generated and Loaded keypair messages
- Previously only matched Generated new keypair with Peer ID:
- Now handles Loaded keypair with Peer ID: for persistent volumes
- Fixes bootstrap discovery after first run when keypairs exist
2. test-pubsub-propagation: Fix bash arithmetic causing premature exit
- Changed ((RECEIVED_COUNT++)) to RECEIVED_COUNT=1
- With set -euo pipefail, post-increment returned 0 causing exit code 1
- Test now correctly reports all 5 nodes receiving messages
3. CI workflow: Add bootstrap init and message propagation test
- start-ci now runs init-bootstrap to sync peer IDs
- test-ci now includes test-pubsub-propagation for end-to-end verification
- Ensures CI validates actual PubSub message delivery, not just infrastructure
These changes enable reliable CI testing of P2P features with proper
bootstrap configuration and functional validation of message propagation
across all 6 nodes.
Add automatic cleanup to start-ci command to guarantee reproducible CI environment. The command now checks for running nodes and executes docker compose down -v before building and starting. This ensures CI runs are always deterministic by: - Removing stale containers and volumes - Forcing fresh peer identity generation - Preventing state pollution between runs Simplifies CI workflow to single command: just start-ci && just test-ci
…ation
- Add visual HTTP POST flow diagram explaining why Node 1 is publisher
- Enhance test-pubsub-propagation with 4 insight sections:
* Propagation timeline with receive timestamps
* Network statistics (message size, success rate, latency)
* Live log preview showing PubSub activity
* Peer connection matrix showing full mesh
- Add educational explanations of Gossipsub, PubSub topics, and CIDs
- Add workflow diagram to README showing daily command flow
- Clarify interactive mode (start) vs CI mode (start-ci) throughout docs
- Emphasize peer ID lifecycle: preserved with 'stop', regenerated after 'clean'
- Update help command with important notes section
Makes P2P testing infrastructure more accessible and educational for new users.
- Add 7 reusable helper functions to eliminate code duplication - Refactor test-pubsub-core using helpers (65 → 35 lines) - Standardize color code format across 8 recipes - Add curl command explanation to PubSub test visualization - Fix documentation references for test-pubsub-propagation No functional changes. Improves maintainability and readability.
…commands Remove dashboard and health-check commands (~195 lines) since test-pubsub-propagation already validates all necessary functionality (nodes running status, mesh connectivity via message propagation, gossipsub activity via successful delivery, end-to-end P2P). Fix HTTP endpoint health check to include required Host header for Hermes gateway.
…second timing Update test-pubsub to publish messages and verify reception on all nodes (minimal output for CI), matching test-pubsub-propagation functionality. Remove redundant test-pubsub-core command. Improve timing precision from seconds to milliseconds using date +%s%3N, showing actual sub-second propagation delays (0.234s instead of 0s) in visualization and final status.
Restructure README with TL;DR and fast restart workflows. Fix outdated command references (dashboard, test-pubsub-core) and clarify prerequisites (Rust toolchain, not Earthly). Remove duplicate entries and confusing peer ID examples. Add TODOs for GitHub Actions integration. Next task: integrate CI workflow into GitHub Actions runners for automated PR testing.
…ection - Restructure README with TL;DR for minimal cognitive load - Add comprehensive comments to docker-compose.yml and Dockerfile - Auto-detect platform in build commands (Linux uses cargo, Mac/Windows uses Earthly) - Fix outdated references and broken commands - Add fast restart workflow documentation - Add GitHub Actions integration TODOs
- Add TL;DR quickstart and fast restart workflows - Auto-detect platform for builds (Linux cargo, Mac/Windows Earthly) - Fix non-existent command references and outdated line numbers - Add comprehensive inline comments to docker-compose - Expand bootstrap peer discovery docs for production deployment
…esting-infrastructure-for-p2p-features
- Extract MESSAGE_PREVIEW_MAX_LEN constant and create format_message_preview() helper - Make IPFS listen port configurable via IPFS_LISTEN_PORT env var (default: 4001) - Make retry settings configurable via IPFS_RETRY_INTERVAL_SECS and IPFS_MAX_RETRIES - Document exponential backoff strategy for DHT provider queries Improves maintainability and configurability without changing behavior.
- Add env_var_or() helper to eliminate duplicate env var parsing - Extract configure_listening_address() and connect_to_bootstrap_peers() from init() - Extract constants: IPFS_DATA_DIR, KEYPAIR_FILENAME, DEFAULT_APP_NAME, DEFAULT_MESH_TOPIC - Simplify retry_bootstrap_connections() with early returns and cleaner logic
…onstraints - Explain what bootstrap peers are and their role in network formation - Document retry strategy for bootstrap connections - Add critical warning: public IPFS nodes cannot be used for Hermes PubSub - Explain why DHT server mode is required for Gossipsub mesh formation - Document auto-subscription strategy to avoid cold start problem Emphasizes the key constraint that Gossipsub requires custom Hermes bootstrap peers since public IPFS nodes don't subscribe to application-specific topics.
…onstraints - Explain what bootstrap peers are and their role in network formation - Document retry strategy for bootstrap connections - Add critical warning: public IPFS nodes cannot be used for Hermes PubSub - Explain why DHT server mode is required for Gossipsub mesh formation - Document auto-subscription strategy to avoid cold start problem Emphasizes the key constraint that Gossipsub requires custom Hermes bootstrap peers since public IPFS nodes don't subscribe to application-specific topics.
- Add Writing Custom Tests section with examples (connectivity, resilience, throughput) - Include template and best practices for extending justfile tests - Fix ANSI escape codes not rendering (add -e flag to echo commands) - Note: proper API testing framework planned when endpoints finalized
- Explain mesh_n parameter and its effect on Gossipsub operations - Add TL;DR sections to bootstrap discovery and PubSub warnings - Clarify cross-platform builds (cargo vs Earthly) and GLIBC compatibility - Add DNS analogy for bootstrap peers concept - List explicit 5-step post document workflow - Explain logging levels (debug = P2P diagnostics, info = app events)
- Explain mesh_n parameter and its effect on Gossipsub operations - Add TL;DR sections to bootstrap discovery and PubSub warnings - Clarify cross-platform builds (cargo vs Earthly) and GLIBC compatibility - Add DNS analogy for bootstrap peers concept - List explicit 5-step post document workflow - Explain logging levels (debug = P2P diagnostics, info = app events)
- Replace confusing need 2+ / 1+ messages with clear explanations - Old: got 1 provider(s), need 2+ - New: found 1 provider(s), waiting for ourselves to appear in DHT query results - The 2+ / 1+ thresholds did not match actual success conditions
- Remove redundant build-all/build-images steps (quickstart handles this)
- Clarify difference between nuclear option (prunes all Docker resources)
vs just clean (only removes p2p-testing volumes)
- Explain mesh_n, bootstrap peers (DNS analogy), and 5-step post workflow - Fix confusing DHT log: need 2+ waiting for ourselves to appear - Add backticks around PubSub and function names (clippy doc-markdown) - Simplify TROUBLESHOOTING reset (quickstart handles rebuilds) - Clarify quickstart uses existing binaries (run just build-all after code changes)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds a multi-node P2P testing infrastructure to validate Gossipsub message propagation across a 6-node Hermes mesh.
This establishes a CI-ready foundation for automated P2P regression testing.
What’s New
P2P Testing Infrastructure
6-Node Test Environment
Full-mesh topology (15 bidirectional connections) with persistent peer IDs
PubSub Propagation Testing
Automated test suite validating message propagation from node 1 → all peers
Cross-Platform Support
OS-aware build system:
CI-Ready Test Suite
Deterministic, non-interactive workflows:
just start-ci && just test-ciQuick Start
Testing
CI / Automated Testing (Clean State)
Developer / Interactive Testing
Custom Tests
See Writing Custom Tests for examples, including:
Architecture
Next Steps
This PR establishes the testing foundation.
The immediate follow-up is to integrate:
just start-ci && just test-ciinto GitHub Actions, enabling: