|
| 1 | +# Offline Inference Blueprint - Infra (SGLang + vLLM) |
| 2 | + |
| 3 | +#### Run offline LLM inference benchmarks using SGLang or vLLM backends with automated performance tracking and MLflow logging. |
| 4 | + |
| 5 | +This blueprint provides a configurable framework to run **offline LLM inference benchmarks** using either the SGLang or vLLM backends. It is designed for cloud GPU environments and supports automated performance benchmarking with MLflow logging. |
| 6 | + |
| 7 | +This blueprint enables you to: |
| 8 | + |
| 9 | +- Run inference locally on GPU nodes using pre-loaded models |
| 10 | +- Benchmark token throughput, latency, and request performance |
| 11 | +- Push results to MLflow for comparison and analysis |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Pre-Filled Samples |
| 16 | + |
| 17 | +| Feature Showcase | Title | Description | Blueprint File | |
| 18 | +| ---------------------------------------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------- | ---------------------------------------------------------------- | |
| 19 | +| Benchmark LLM performance using SGLang backend with offline inference for accurate performance measurement | Offline inference with LLaMA 3 | Benchmarks Meta-Llama-3.1-8B model using SGLang on VM.GPU.A10.2 with 2 GPUs. | [offline_deployment_sglang.json](offline_deployment_sglang.json) | |
| 20 | +| Benchmark LLM performance using vLLM backend with offline inference for token throughput analysis | Offline inference with LLAMA 3- vLLM | Benchmarks Meta-Llama-3.1-8B model using vLLM on VM.GPU.A10.2 with 2 GPUs. | [offline_deployment_vllm.json](offline_deployment_vllm.json) | |
| 21 | + |
| 22 | +You can access these pre-filled samples from the OCI AI Blueprint portal. |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## When to use Offline inference |
| 27 | + |
| 28 | +Offline inference is ideal for: |
| 29 | + |
| 30 | +- Accurate performance benchmarking (no API or network bottlenecks) |
| 31 | +- Comparing GPU hardware performance (A10, A100, H100, MI300X) |
| 32 | +- Evaluating backend frameworks like vLLM and SGLang |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Supported Backends |
| 37 | + |
| 38 | +| Backend | Description | |
| 39 | +| ------- | ------------------------------------------------------------------- | |
| 40 | +| sglang | Fast multi-modal LLM backend with optimized throughput | |
| 41 | +| vllm | Token streaming inference engine for LLMs with speculative decoding | |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +## Running the Benchmark |
| 46 | + |
| 47 | +- Things need to run the benchmark |
| 48 | + - Model checkpoints pre-downloaded and stored in an object storage. |
| 49 | + - Make sure to get a PAR for the object storage where the models are saved. With listing, write and read perimissions |
| 50 | + - A Bucket to save the outputs. This does not take a PAR, so should be a bucket in the same tenancy as to where you have your OCI blueprints stack |
| 51 | + - Config `.yaml` file that has all the parameters required to run the benhcmark. This includes input_len, output_len, gpu_utilization value etc. |
| 52 | + - Deployment `.json` to deploy your blueprint. |
| 53 | + - Sample deployment and config files are provided below along with links. |
| 54 | + |
| 55 | +This blueprint supports benchmark execution via a job-mode recipe using a YAML config file. The recipe mounts a model and config file from Object Storage, runs offline inference, and logs metrics. |
| 56 | + |
| 57 | +### Notes : Make sure your output object storage is in the same tenancy as your stack. |
| 58 | + |
| 59 | +## Sample Blueprints |
| 60 | + |
| 61 | +[Sample Blueprint (Job Mode for Offline SGLang Inference)](offline_deployment_sglang.json) |
| 62 | +[Sample Blueprint (Job Mode for Offline vLLM Inference)](offline_deployment_vllm.json) |
| 63 | +[Sample Config File SGlang ](offline_sglang_example.yaml) |
| 64 | +[Sample Config File - vLLM ](offline_vllm_example.yaml) |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## Metrics Logged |
| 69 | + |
| 70 | +- `requests_per_second` |
| 71 | +- `input_tokens_per_second` |
| 72 | +- `output_tokens_per_second` |
| 73 | +- `total_tokens_per_second` |
| 74 | +- `elapsed_time` |
| 75 | +- `total_input_tokens` |
| 76 | +- `total_output_tokens` |
| 77 | + |
| 78 | +If a dataset is provided: |
| 79 | + |
| 80 | +- `accuracy` |
| 81 | + |
| 82 | +### Top-level Deployment Keys |
| 83 | + |
| 84 | +| Key | Description | |
| 85 | +| ------------------- | ---------------------------------------------------------------------------- | |
| 86 | +| `recipe_id` | Identifier of the recipe to run; here, it's an offline SGLang benchmark job. | |
| 87 | +| `recipe_mode` | Specifies this is a `job`, meaning it runs to completion and exits. | |
| 88 | +| `deployment_name` | Human-readable name for the job. | |
| 89 | +| `recipe_image_uri` | Docker image containing the benchmark code and dependencies. | |
| 90 | +| `recipe_node_shape` | Shape of the VM or GPU node to run the job (e.g., VM.GPU.A10.2). | |
| 91 | + |
| 92 | +### Input Object Storage |
| 93 | + |
| 94 | +| Key | Description | |
| 95 | +| ---------------------- | ---------------------------------------------------------------------------- | |
| 96 | +| `input_object_storage` | List of inputs to mount from Object Storage. | |
| 97 | +| `par` | Pre-Authenticated Request (PAR) link to a bucket/folder. | |
| 98 | +| `mount_location` | Files are mounted to this path inside the container. | |
| 99 | +| `volume_size_in_gbs` | Size of the mount volume. | |
| 100 | +| `include` | Only these files/folders from the bucket are mounted (e.g., model + config). | |
| 101 | + |
| 102 | +### Output Object Storage |
| 103 | + |
| 104 | +| Key | Description | |
| 105 | +| ----------------------- | ------------------------------------------------------- | |
| 106 | +| `output_object_storage` | Where to store outputs like benchmark logs or results. | |
| 107 | +| `bucket_name` | Name of the output bucket in OCI Object Storage. | |
| 108 | +| `mount_location` | Mount point inside container where outputs are written. | |
| 109 | +| `volume_size_in_gbs` | Size of this volume in GBs. | |
| 110 | + |
| 111 | +### Runtime & Infra Settings |
| 112 | + |
| 113 | +| Key | Description | |
| 114 | +| ---------------------------------------------- | ------------------------------------------------------------- | |
| 115 | +| `recipe_container_command_args` | Path to the YAML config that defines benchmark parameters. | |
| 116 | +| `recipe_replica_count` | Number of job replicas to run (usually 1 for inference). | |
| 117 | +| `recipe_container_port` | Port (optional for offline mode; required if API is exposed). | |
| 118 | +| `recipe_nvidia_gpu_count` | Number of GPUs allocated to this job. | |
| 119 | +| `recipe_node_pool_size` | Number of nodes in the pool (1 means 1 VM). | |
| 120 | +| `recipe_node_boot_volume_size_in_gbs` | Disk size for OS + dependencies. | |
| 121 | +| `recipe_ephemeral_storage_size` | Local scratch space in GBs. | |
| 122 | +| `recipe_shared_memory_volume_size_limit_in_mb` | Shared memory (used by some inference engines). | |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +## **Sample Config File (`example_sglang.yaml`)** |
| 127 | + |
| 128 | +This file is consumed by the container during execution to configure the benchmark run. |
| 129 | + |
| 130 | +### Inference Setup |
| 131 | + |
| 132 | +| Key | Description | |
| 133 | +| ------------------- | ----------------------------------------------------------------- | |
| 134 | +| `benchmark_type` | Set to `offline` to indicate local execution with no HTTP server. | |
| 135 | +| `offline_backend` | Backend engine to use (`sglang` or `vllm`). | |
| 136 | +| `model_path` | Path to the model directory (already mounted via Object Storage). | |
| 137 | +| `tokenizer_path` | Path to the tokenizer (usually same as model path). | |
| 138 | +| `trust_remote_code` | Enables loading models that require custom code (Hugging Face). | |
| 139 | +| `conv_template` | Prompt formatting template to use (e.g., `llama-2`). | |
| 140 | + |
| 141 | +### Benchmark Parameters |
| 142 | + |
| 143 | +| Key | Description | |
| 144 | +| ---------------- | ---------------------------------------------------------------------- | |
| 145 | +| `input_len` | Number of tokens in the input prompt. | |
| 146 | +| `output_len` | Number of tokens to generate. | |
| 147 | +| `num_prompts` | Number of total prompts to run (e.g., 64 prompts x 128 output tokens). | |
| 148 | +| `max_seq_len` | Max sequence length supported by the model (e.g., 4096). | |
| 149 | +| `max_batch_size` | Max batch size per inference run (depends on GPU memory). | |
| 150 | +| `dtype` | Precision (e.g., float16, bfloat16, auto). | |
| 151 | + |
| 152 | +### Sampling Settings |
| 153 | + |
| 154 | +| Key | Description | |
| 155 | +| ------------- | --------------------------------------------------------------- | |
| 156 | +| `temperature` | Controls randomness in generation (lower = more deterministic). | |
| 157 | +| `top_p` | Top-p sampling for diversity (0.9 keeps most probable tokens). | |
| 158 | + |
| 159 | +### MLflow Logging |
| 160 | + |
| 161 | +| Key | Description | |
| 162 | +| ----------------- | -------------------------------------------- | |
| 163 | +| `mlflow_uri` | MLflow server to log performance metrics. | |
| 164 | +| `experiment_name` | Experiment name to group runs in MLflow UI. | |
| 165 | +| `run_name` | Custom name to identify this particular run. | |
| 166 | + |
| 167 | +### Output |
| 168 | + |
| 169 | +| Key | Description | |
| 170 | +| ------------------- | -------------------------------------------------------------- | |
| 171 | +| `save_metrics_path` | Path inside the container where metrics will be saved as JSON. | |
0 commit comments