GPU-level LLM inference profiler that analyzes token-level performance and provides AI-powered explanations.
Try it now: https://siddhant-k-code--llmtracefx-web-app.modal.run (might be not available at all times due to Modal's free tier limitations 🙈)
Quick API test:
curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/analyze-trace" \
-H "Content-Type: application/json" \
-d '{"trace_data": {"tokens": [{"id": 0, "text": "Hello", "operations": [{"name": "matmul", "start_time": 0, "duration": 15.3}]}]}, "gpu_type": "A10G", "enable_claude": false}'
Upload your trace file:
curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/upload-trace" \
-F "file=@your_trace.json" -F "gpu_type=A10G" -F "enable_claude=true"
- Token-level profiling of LLM inference with kernel timing analysis
- GPU bottleneck detection (stall %, launch delays, memory issues)
- AI explanations using Claude API for performance insights
- Interactive visualizations with flame graphs and dashboards
- Modal.com deployment with GPU acceleration
- Multiple input formats (vLLM, generic trace logs)
git clone https://github.com/Siddhant-K-code/LLMTraceFX.git
cd LLMTraceFX
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv sync
# Or install in development mode with optional dependencies
uv sync --extra dev --extra test
git clone https://github.com/Siddhant-K-code/LLMTraceFX.git
cd LLMTraceFX
pip install -r llmtracefx/requirements.txt
# Or install as editable package
pip install -e .
# With uv
uv run llmtracefx --trace sample
uv run llmtracefx --trace your_trace.json --gpu-type A10G
uv run llmtracefx --trace sample --no-claude
# Or activate virtual environment first
uv sync
source .venv/bin/activate # or .venv\Scripts\activate on Windows
llmtracefx --trace sample
# With pip/python
python -m llmtracefx.main --trace sample
python -m llmtracefx.main --trace your_trace.json --gpu-type A10G
python -m llmtracefx.main --trace sample --no-claude
# With uv
uv run llmtracefx-serve
# Or with python
python -m llmtracefx.api.serve
# Access at http://localhost:8000
# Setup Modal secrets
uv run modal secret create claude-api-key CLAUDE_API_KEY=your_api_key
# Deploy to Modal
uv run modal deploy llmtracefx/modal_app.py
# Test with sample data
uv run modal run llmtracefx/modal_app.py
Once deployed, your app is available at:
https://siddhant-k-code--llmtracefx-web-app.modal.run
# Test the deployed API
curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/analyze-trace" \
-H "Content-Type: application/json" \
-d '{
"trace_data": {
"tokens": [
{
"id": 0,
"text": "Hello",
"operations": [
{"name": "matmul", "start_time": 0, "duration": 15.3}
]
}
]
},
"gpu_type": "A10G",
"enable_claude": false
}'
export CLAUDE_API_KEY="your_claude_api_key"
export DEFAULT_GPU_TYPE="A10G" # or H100, A100
export ENABLE_CLAUDE="true"
export DASHBOARD_PORT="8000"
- Get API key from Anthropic
- Set environment variable:
export CLAUDE_API_KEY="your_key"
- Or create Modal secret:
modal secret create claude-api-key CLAUDE_API_KEY=your_key
🔍 Analyzing trace: sample
📊 Using sample trace data
🔧 Analyzing GPU performance (GPU: A10G)
📈 Analysis complete:
Total tokens: 5
Total latency: 120.5ms
Avg latency per token: 24.1ms
Avg performance score: 67.3/100
- Flame Graph: Token vs operations timeline
- Bottleneck Distribution: Types of performance issues
- Performance Trends: Latency and score over time
- Heatmap: Operation duration patterns
- GPU Metrics: Radar charts for detailed analysis
Base URL: https://siddhant-k-code--llmtracefx-web-app.modal.run
POST /upload-trace # Upload trace file
POST /analyze-trace # Analyze trace data
GET /analysis/{id} # Get analysis summary
GET /token/{id}/{token} # Get token details
GET /explain/{id}/{token} # Get Claude explanation
GET /dashboard/{id} # Get HTML dashboard
GET /export/{id} # Export JSON data
For local development: http://localhost:8000
import requests
# Analyze trace data directly
response = requests.post(
'https://siddhant-k-code--llmtracefx-web-app.modal.run/analyze-trace',
json={
"trace_data": {
"tokens": [
{
"id": 0,
"text": "Hello",
"operations": [
{"name": "matmul", "start_time": 0, "duration": 15.3}
]
}
]
},
"gpu_type": "A10G",
"enable_claude": True
}
)
analysis_id = response.json()['analysis_id']
# Get dashboard
dashboard = requests.get(f'https://siddhant-k-code--llmtracefx-web-app.modal.run/dashboard/{analysis_id}')
with open('dashboard.html', 'w') as f:
f.write(dashboard.text)
print(f"Performance score: {response.json()['avg_performance_score']:.1f}/100")
# Upload your vLLM trace file
curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/upload-trace" \
-F "file=@your_trace.json" \
-F "gpu_type=A10G" \
-F "enable_claude=true"
import requests
# Upload trace (local server)
with open('trace.json', 'rb') as f:
response = requests.post('http://localhost:8000/upload-trace', files={'file': f})
analysis_id = response.json()['analysis_id']
# Get dashboard (local server)
dashboard = requests.get(f'http://localhost:8000/dashboard/{analysis_id}')
{
"tokens": [
{
"id": 0,
"text": "Hello",
"operations": [
{"name": "embedding", "start_time": 0, "duration": 2.1},
{"name": "rmsnorm", "start_time": 2.1, "duration": 1.8},
{"name": "matmul", "start_time": 3.9, "duration": 15.3}
]
}
]
}
{
"events": [
{
"token_id": 0,
"token_text": "Hello",
"op_name": "matmul",
"timestamp": 12.1,
"duration": 15.3,
"metadata": {}
}
]
}
rmsnorm
/layernorm
- Normalization layerslinear
/matmul
- Matrix operationssoftmax
- Attention computationskvload
/kvstore
- Key-Value cache operationsattention
- Attention mechanismsembedding
- Token embeddings
- Stall Percentage: Memory-bound bottlenecks
- Launch Delay: Kernel launch overhead
- SM Occupancy: Streaming multiprocessor utilization
- Cache Hit Rate: Memory access efficiency
- Compute Utilization: GPU computational usage
- A10G: 24GB VRAM, 600 GB/s bandwidth
- H100: 80GB VRAM, 3350 GB/s bandwidth
- A100: 80GB VRAM, 1935 GB/s bandwidth
- Performance Summary: High-level bottleneck analysis
- Technical Details: GPU-specific explanations
- Optimization Suggestions: Actionable improvements
- Severity Assessment: Priority ranking
🔍 Token 42 Analysis
**Summary:** MatMul operation shows 33% memory stall due to poor coalescing
**Technical Details:** The matrix multiplication kernel is experiencing
significant memory bandwidth limitations due to non-coalesced memory access
patterns. This is causing the GPU to wait for memory operations.
**Optimization Recommendations:**
• Consider transposing matrices for better memory layout
• Implement tiling strategies to improve cache utilization
• Use tensor cores if available for better compute efficiency
**Severity:** HIGH
memory_stall
: High memory latencylaunch_overhead
: Kernel launch delayslow_occupancy
: Underutilized GPU corescache_miss
: Poor memory localitycompute_underutilization
: Low computational throughput
high_memory_stall
: Memory bandwidth issueskernel_fusion_candidate
: Multiple small kernelsincrease_occupancy
: Low SM utilizationimprove_data_locality
: Cache optimization needednorm_linear_fusion
: Specific fusion opportunity
- Web API: https://siddhant-k-code--llmtracefx-web-app.modal.run
- Modal Dashboard: https://modal.com/apps/siddhant-k-code/main/deployed/llmtracefx
- GPU: A10G acceleration available
- Claude Integration: AI explanations ready
analyze_trace_modal
: Full trace analysis with GPUexplain_token_modal
: Individual token explanationsweb_app
: FastAPI web endpoint (deployed)run_server
: FastAPI server for local developmentcreate_sample_trace
: Generate test data
# Deploy app
uv run modal deploy llmtracefx/modal_app.py
# Run analysis
uv run modal run llmtracefx/modal_app.py
# Test deployed API
curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/analyze-trace" \
-H "Content-Type: application/json" \
-d '{"trace_data": {"tokens": [{"id": 0, "text": "test", "operations": [{"name": "matmul", "start_time": 0, "duration": 10.0}]}]}, "gpu_type": "A10G", "enable_claude": false}'
# View deployment status
uv run modal app list
# Check function logs
uv run modal app logs llmtracefx
# Stop deployment
uv run modal app stop llmtracefx
python -m llmtracefx.main --create-sample
python -m llmtracefx.main --trace test_traces/sample_vllm_trace.json
This project is licensed under the MIT License - see the LICENSE file for details.