🚀 LLMTraceFX

GPU-level LLM inference profiler that analyzes token-level performance and provides AI-powered explanations.

🌐 Live Demo

Try it now: https://siddhant-k-code--llmtracefx-web-app.modal.run (might be not available at all times due to Modal's free tier limitations 🙈)

Quick API test:

curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/analyze-trace" \
-H "Content-Type: application/json" \
-d '{"trace_data": {"tokens": [{"id": 0, "text": "Hello", "operations": [{"name": "matmul", "start_time": 0, "duration": 15.3}]}]}, "gpu_type": "A10G", "enable_claude": false}'

Upload your trace file:

curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/upload-trace" \
     -F "file=@your_trace.json" -F "gpu_type=A10G" -F "enable_claude=true"

🎯 Features

Token-level profiling of LLM inference with kernel timing analysis
GPU bottleneck detection (stall %, launch delays, memory issues)
AI explanations using Claude API for performance insights
Interactive visualizations with flame graphs and dashboards
Modal.com deployment with GPU acceleration
Multiple input formats (vLLM, generic trace logs)

📦 Installation

Using uv (Recommended)

git clone https://github.com/Siddhant-K-code/LLMTraceFX.git
cd LLMTraceFX

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv sync

# Or install in development mode with optional dependencies
uv sync --extra dev --extra test

Using pip

git clone https://github.com/Siddhant-K-code/LLMTraceFX.git
cd LLMTraceFX
pip install -r llmtracefx/requirements.txt

# Or install as editable package
pip install -e .

🔧 Quick Start

1. CLI Usage

# With uv
uv run llmtracefx --trace sample
uv run llmtracefx --trace your_trace.json --gpu-type A10G
uv run llmtracefx --trace sample --no-claude

# Or activate virtual environment first
uv sync
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
llmtracefx --trace sample

# With pip/python
python -m llmtracefx.main --trace sample
python -m llmtracefx.main --trace your_trace.json --gpu-type A10G
python -m llmtracefx.main --trace sample --no-claude

2. FastAPI Server

# With uv
uv run llmtracefx-serve

# Or with python
python -m llmtracefx.api.serve

# Access at http://localhost:8000

3. Modal Deployment

# Setup Modal secrets
uv run modal secret create claude-api-key CLAUDE_API_KEY=your_api_key

# Deploy to Modal
uv run modal deploy llmtracefx/modal_app.py

# Test with sample data
uv run modal run llmtracefx/modal_app.py

🌐 Live Web API

Once deployed, your app is available at:

https://siddhant-k-code--llmtracefx-web-app.modal.run

Quick API Test

# Test the deployed API
curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/analyze-trace" \
-H "Content-Type: application/json" \
-d '{
  "trace_data": {
    "tokens": [
      {
        "id": 0,
        "text": "Hello",
        "operations": [
          {"name": "matmul", "start_time": 0, "duration": 15.3}
        ]
      }
    ]
  },
  "gpu_type": "A10G",
  "enable_claude": false
}'

🔑 Configuration

Environment Variables

export CLAUDE_API_KEY="your_claude_api_key"
export DEFAULT_GPU_TYPE="A10G"  # or H100, A100
export ENABLE_CLAUDE="true"
export DASHBOARD_PORT="8000"

Claude API Setup

Get API key from Anthropic
Set environment variable: export CLAUDE_API_KEY="your_key"
Or create Modal secret: modal secret create claude-api-key CLAUDE_API_KEY=your_key

📊 Output Examples

CLI Output

🔍 Analyzing trace: sample
📊 Using sample trace data
🔧 Analyzing GPU performance (GPU: A10G)
📈 Analysis complete:
   Total tokens: 5
   Total latency: 120.5ms
   Avg latency per token: 24.1ms
   Avg performance score: 67.3/100

Dashboard Features

Flame Graph: Token vs operations timeline
Bottleneck Distribution: Types of performance issues
Performance Trends: Latency and score over time
Heatmap: Operation duration patterns
GPU Metrics: Radar charts for detailed analysis

🎮 API Endpoints

🌐 Deployed API (Modal)

Base URL: https://siddhant-k-code--llmtracefx-web-app.modal.run

POST /upload-trace          # Upload trace file
POST /analyze-trace         # Analyze trace data
GET  /analysis/{id}         # Get analysis summary
GET  /token/{id}/{token}    # Get token details
GET  /explain/{id}/{token}  # Get Claude explanation
GET  /dashboard/{id}        # Get HTML dashboard
GET  /export/{id}           # Export JSON data

🏠 Local FastAPI Server

For local development: http://localhost:8000

Example Usage

Production API (Deployed)

import requests

# Analyze trace data directly
response = requests.post(
    'https://siddhant-k-code--llmtracefx-web-app.modal.run/analyze-trace',
    json={
        "trace_data": {
            "tokens": [
                {
                    "id": 0,
                    "text": "Hello",
                    "operations": [
                        {"name": "matmul", "start_time": 0, "duration": 15.3}
                    ]
                }
            ]
        },
        "gpu_type": "A10G",
        "enable_claude": True
    }
)

analysis_id = response.json()['analysis_id']

# Get dashboard
dashboard = requests.get(f'https://siddhant-k-code--llmtracefx-web-app.modal.run/dashboard/{analysis_id}')
with open('dashboard.html', 'w') as f:
    f.write(dashboard.text)

print(f"Performance score: {response.json()['avg_performance_score']:.1f}/100")

Upload Trace File

# Upload your vLLM trace file
curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/upload-trace" \
     -F "file=@your_trace.json" \
     -F "gpu_type=A10G" \
     -F "enable_claude=true"

Local Development

import requests

# Upload trace (local server)
with open('trace.json', 'rb') as f:
    response = requests.post('http://localhost:8000/upload-trace', files={'file': f})

analysis_id = response.json()['analysis_id']

# Get dashboard (local server)
dashboard = requests.get(f'http://localhost:8000/dashboard/{analysis_id}')

🔬 Trace Format

vLLM Format

{
  "tokens": [
    {
      "id": 0,
      "text": "Hello",
      "operations": [
        {"name": "embedding", "start_time": 0, "duration": 2.1},
        {"name": "rmsnorm", "start_time": 2.1, "duration": 1.8},
        {"name": "matmul", "start_time": 3.9, "duration": 15.3}
      ]
    }
  ]
}

Event Format

{
  "events": [
    {
      "token_id": 0,
      "token_text": "Hello",
      "op_name": "matmul",
      "timestamp": 12.1,
      "duration": 15.3,
      "metadata": {}
    }
  ]
}

🎯 GPU Analysis

Supported Operations

rmsnorm / layernorm - Normalization layers
linear / matmul - Matrix operations
softmax - Attention computations
kvload / kvstore - Key-Value cache operations
attention - Attention mechanisms
embedding - Token embeddings

GPU Metrics

Stall Percentage: Memory-bound bottlenecks
Launch Delay: Kernel launch overhead
SM Occupancy: Streaming multiprocessor utilization
Cache Hit Rate: Memory access efficiency
Compute Utilization: GPU computational usage

Supported GPUs

A10G: 24GB VRAM, 600 GB/s bandwidth
H100: 80GB VRAM, 3350 GB/s bandwidth
A100: 80GB VRAM, 1935 GB/s bandwidth

🤖 Claude Integration

Explanation Types

Performance Summary: High-level bottleneck analysis
Technical Details: GPU-specific explanations
Optimization Suggestions: Actionable improvements
Severity Assessment: Priority ranking

Example Claude Output

🔍 Token 42 Analysis

**Summary:** MatMul operation shows 33% memory stall due to poor coalescing

**Technical Details:** The matrix multiplication kernel is experiencing
significant memory bandwidth limitations due to non-coalesced memory access
patterns. This is causing the GPU to wait for memory operations.

**Optimization Recommendations:**
• Consider transposing matrices for better memory layout
• Implement tiling strategies to improve cache utilization
• Use tensor cores if available for better compute efficiency

**Severity:** HIGH

📈 Performance Optimization

Bottleneck Types

memory_stall: High memory latency
launch_overhead: Kernel launch delays
low_occupancy: Underutilized GPU cores
cache_miss: Poor memory locality
compute_underutilization: Low computational throughput

Optimization Flags

high_memory_stall: Memory bandwidth issues
kernel_fusion_candidate: Multiple small kernels
increase_occupancy: Low SM utilization
improve_data_locality: Cache optimization needed
norm_linear_fusion: Specific fusion opportunity

🚀 Modal Deployment

🌐 Live Deployment

Web API: https://siddhant-k-code--llmtracefx-web-app.modal.run
Modal Dashboard: https://modal.com/apps/siddhant-k-code/main/deployed/llmtracefx
GPU: A10G acceleration available
Claude Integration: AI explanations ready

Functions

analyze_trace_modal: Full trace analysis with GPU
explain_token_modal: Individual token explanations
web_app: FastAPI web endpoint (deployed)
run_server: FastAPI server for local development
create_sample_trace: Generate test data

Deployment Commands

# Deploy app
uv run modal deploy llmtracefx/modal_app.py

# Run analysis
uv run modal run llmtracefx/modal_app.py

# Test deployed API
curl -X POST "https://siddhant-k-code--llmtracefx-web-app.modal.run/analyze-trace" \
-H "Content-Type: application/json" \
-d '{"trace_data": {"tokens": [{"id": 0, "text": "test", "operations": [{"name": "matmul", "start_time": 0, "duration": 10.0}]}]}, "gpu_type": "A10G", "enable_claude": false}'

Management

# View deployment status
uv run modal app list

# Check function logs
uv run modal app logs llmtracefx

# Stop deployment
uv run modal app stop llmtracefx

🧪 Testing

Create Sample Data

python -m llmtracefx.main --create-sample

Run Tests

python -m llmtracefx.main --trace test_traces/sample_vllm_trace.json

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
llmtracefx		llmtracefx
.gitignore		.gitignore
DEPLOYMENT_GUIDE.md		DEPLOYMENT_GUIDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
generate_trace.py		generate_trace.py
launch_dashboard.py		launch_dashboard.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

Siddhant-K-code/LLMTraceFX

Folders and files

Latest commit

History

Repository files navigation