Skip to content

Commit edb14d8

Browse files
Add offline and online inference blueprints and configuration files
- Introduced new JSON and YAML files for offline inference benchmarks using SGLang and vLLM backends. - Added README documentation for both offline and online inference blueprints, detailing usage, supported scenarios, and sample configurations. - Removed outdated README files for offline and online inference to streamline documentation.
1 parent a7fe7fd commit edb14d8

File tree

12 files changed

+268
-410
lines changed

12 files changed

+268
-410
lines changed
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Offline Inference Blueprint - Infra (SGLang + vLLM)
2+
3+
#### Run offline LLM inference benchmarks using SGLang or vLLM backends with automated performance tracking and MLflow logging.
4+
5+
This blueprint provides a configurable framework to run **offline LLM inference benchmarks** using either the SGLang or vLLM backends. It is designed for cloud GPU environments and supports automated performance benchmarking with MLflow logging.
6+
7+
This blueprint enables you to:
8+
9+
- Run inference locally on GPU nodes using pre-loaded models
10+
- Benchmark token throughput, latency, and request performance
11+
- Push results to MLflow for comparison and analysis
12+
13+
---
14+
15+
## Pre-Filled Samples
16+
17+
| Feature Showcase | Title | Description | Blueprint File |
18+
| ---------------------------------------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------- | ---------------------------------------------------------------- |
19+
| Benchmark LLM performance using SGLang backend with offline inference for accurate performance measurement | Offline inference with LLaMA 3 | Benchmarks Meta-Llama-3.1-8B model using SGLang on VM.GPU.A10.2 with 2 GPUs. | [offline_deployment_sglang.json](offline_deployment_sglang.json) |
20+
| Benchmark LLM performance using vLLM backend with offline inference for token throughput analysis | Offline inference with LLAMA 3- vLLM | Benchmarks Meta-Llama-3.1-8B model using vLLM on VM.GPU.A10.2 with 2 GPUs. | [offline_deployment_vllm.json](offline_deployment_vllm.json) |
21+
22+
You can access these pre-filled samples from the OCI AI Blueprint portal.
23+
24+
---
25+
26+
## When to use Offline inference
27+
28+
Offline inference is ideal for:
29+
30+
- Accurate performance benchmarking (no API or network bottlenecks)
31+
- Comparing GPU hardware performance (A10, A100, H100, MI300X)
32+
- Evaluating backend frameworks like vLLM and SGLang
33+
34+
---
35+
36+
## Supported Backends
37+
38+
| Backend | Description |
39+
| ------- | ------------------------------------------------------------------- |
40+
| sglang | Fast multi-modal LLM backend with optimized throughput |
41+
| vllm | Token streaming inference engine for LLMs with speculative decoding |
42+
43+
---
44+
45+
## Running the Benchmark
46+
47+
- Things need to run the benchmark
48+
- Model checkpoints pre-downloaded and stored in an object storage.
49+
- Make sure to get a PAR for the object storage where the models are saved. With listing, write and read perimissions
50+
- A Bucket to save the outputs. This does not take a PAR, so should be a bucket in the same tenancy as to where you have your OCI blueprints stack
51+
- Config `.yaml` file that has all the parameters required to run the benhcmark. This includes input_len, output_len, gpu_utilization value etc.
52+
- Deployment `.json` to deploy your blueprint.
53+
- Sample deployment and config files are provided below along with links.
54+
55+
This blueprint supports benchmark execution via a job-mode recipe using a YAML config file. The recipe mounts a model and config file from Object Storage, runs offline inference, and logs metrics.
56+
57+
### Notes : Make sure your output object storage is in the same tenancy as your stack.
58+
59+
## Sample Blueprints
60+
61+
[Sample Blueprint (Job Mode for Offline SGLang Inference)](offline_deployment_sglang.json)
62+
[Sample Blueprint (Job Mode for Offline vLLM Inference)](offline_deployment_vllm.json)
63+
[Sample Config File SGlang ](offline_sglang_example.yaml)
64+
[Sample Config File - vLLM ](offline_vllm_example.yaml)
65+
66+
---
67+
68+
## Metrics Logged
69+
70+
- `requests_per_second`
71+
- `input_tokens_per_second`
72+
- `output_tokens_per_second`
73+
- `total_tokens_per_second`
74+
- `elapsed_time`
75+
- `total_input_tokens`
76+
- `total_output_tokens`
77+
78+
If a dataset is provided:
79+
80+
- `accuracy`
81+
82+
### Top-level Deployment Keys
83+
84+
| Key | Description |
85+
| ------------------- | ---------------------------------------------------------------------------- |
86+
| `recipe_id` | Identifier of the recipe to run; here, it's an offline SGLang benchmark job. |
87+
| `recipe_mode` | Specifies this is a `job`, meaning it runs to completion and exits. |
88+
| `deployment_name` | Human-readable name for the job. |
89+
| `recipe_image_uri` | Docker image containing the benchmark code and dependencies. |
90+
| `recipe_node_shape` | Shape of the VM or GPU node to run the job (e.g., VM.GPU.A10.2). |
91+
92+
### Input Object Storage
93+
94+
| Key | Description |
95+
| ---------------------- | ---------------------------------------------------------------------------- |
96+
| `input_object_storage` | List of inputs to mount from Object Storage. |
97+
| `par` | Pre-Authenticated Request (PAR) link to a bucket/folder. |
98+
| `mount_location` | Files are mounted to this path inside the container. |
99+
| `volume_size_in_gbs` | Size of the mount volume. |
100+
| `include` | Only these files/folders from the bucket are mounted (e.g., model + config). |
101+
102+
### Output Object Storage
103+
104+
| Key | Description |
105+
| ----------------------- | ------------------------------------------------------- |
106+
| `output_object_storage` | Where to store outputs like benchmark logs or results. |
107+
| `bucket_name` | Name of the output bucket in OCI Object Storage. |
108+
| `mount_location` | Mount point inside container where outputs are written. |
109+
| `volume_size_in_gbs` | Size of this volume in GBs. |
110+
111+
### Runtime & Infra Settings
112+
113+
| Key | Description |
114+
| ---------------------------------------------- | ------------------------------------------------------------- |
115+
| `recipe_container_command_args` | Path to the YAML config that defines benchmark parameters. |
116+
| `recipe_replica_count` | Number of job replicas to run (usually 1 for inference). |
117+
| `recipe_container_port` | Port (optional for offline mode; required if API is exposed). |
118+
| `recipe_nvidia_gpu_count` | Number of GPUs allocated to this job. |
119+
| `recipe_node_pool_size` | Number of nodes in the pool (1 means 1 VM). |
120+
| `recipe_node_boot_volume_size_in_gbs` | Disk size for OS + dependencies. |
121+
| `recipe_ephemeral_storage_size` | Local scratch space in GBs. |
122+
| `recipe_shared_memory_volume_size_limit_in_mb` | Shared memory (used by some inference engines). |
123+
124+
---
125+
126+
## **Sample Config File (`example_sglang.yaml`)**
127+
128+
This file is consumed by the container during execution to configure the benchmark run.
129+
130+
### Inference Setup
131+
132+
| Key | Description |
133+
| ------------------- | ----------------------------------------------------------------- |
134+
| `benchmark_type` | Set to `offline` to indicate local execution with no HTTP server. |
135+
| `offline_backend` | Backend engine to use (`sglang` or `vllm`). |
136+
| `model_path` | Path to the model directory (already mounted via Object Storage). |
137+
| `tokenizer_path` | Path to the tokenizer (usually same as model path). |
138+
| `trust_remote_code` | Enables loading models that require custom code (Hugging Face). |
139+
| `conv_template` | Prompt formatting template to use (e.g., `llama-2`). |
140+
141+
### Benchmark Parameters
142+
143+
| Key | Description |
144+
| ---------------- | ---------------------------------------------------------------------- |
145+
| `input_len` | Number of tokens in the input prompt. |
146+
| `output_len` | Number of tokens to generate. |
147+
| `num_prompts` | Number of total prompts to run (e.g., 64 prompts x 128 output tokens). |
148+
| `max_seq_len` | Max sequence length supported by the model (e.g., 4096). |
149+
| `max_batch_size` | Max batch size per inference run (depends on GPU memory). |
150+
| `dtype` | Precision (e.g., float16, bfloat16, auto). |
151+
152+
### Sampling Settings
153+
154+
| Key | Description |
155+
| ------------- | --------------------------------------------------------------- |
156+
| `temperature` | Controls randomness in generation (lower = more deterministic). |
157+
| `top_p` | Top-p sampling for diversity (0.9 keeps most probable tokens). |
158+
159+
### MLflow Logging
160+
161+
| Key | Description |
162+
| ----------------- | -------------------------------------------- |
163+
| `mlflow_uri` | MLflow server to log performance metrics. |
164+
| `experiment_name` | Experiment name to group runs in MLflow UI. |
165+
| `run_name` | Custom name to identify this particular run. |
166+
167+
### Output
168+
169+
| Key | Description |
170+
| ------------------- | -------------------------------------------------------------- |
171+
| `save_metrics_path` | Path inside the container where metrics will be saved as JSON. |
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Online Inference Blueprint (LLMPerf)
2+
3+
#### Benchmark online inference performance of large language models using LLMPerf standardized benchmarking tool.
4+
5+
This blueprint benchmarks **online inference performance** of large language models using **LLMPerf**, a standardized benchmarking tool. It is designed to evaluate LLM APIs served via platforms such as OpenAI-compatible interfaces, including self-hosted LLM inference endpoints.
6+
7+
This blueprint helps:
8+
9+
- Simulate real-time request load on a running model server
10+
- Measure end-to-end latency, throughput, and completion performance
11+
- Push results to MLflow for visibility and tracking
12+
13+
---
14+
15+
## Pre-Filled Samples
16+
17+
| Feature Showcase | Title | Description | Blueprint File |
18+
| --------------------------------------------------------------------------------------------------- | ----------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------ |
19+
| Benchmark live LLM API endpoints using LLMPerf to measure real-time performance and latency metrics | Online inference on LLaMA 3 using LLMPerf | Benchmark of meta/llama3-8b-instruct via a local OpenAI-compatible endpoint | [online_deployment.json](online_deployment.json) |
20+
21+
These can be accessed directly from the OCI AI Blueprint portal.
22+
23+
---
24+
25+
## Prerequisites
26+
27+
Before running this blueprint:
28+
29+
- You **must have an inference server already running**, compatible with the OpenAI API format.
30+
- Ensure the endpoint and model name match what’s defined in the config.
31+
32+
---
33+
34+
## Supported Scenarios
35+
36+
| Use Case | Description |
37+
| --------------------- | ------------------------------------------------------- |
38+
| Local LLM APIs | Benchmark your own self-hosted models (e.g., vLLM) |
39+
| Remote OpenAI API | Benchmark OpenAI deployments for throughput analysis |
40+
| Multi-model endpoints | Test latency/throughput across different configurations |
41+
42+
---
43+
44+
## Sample Blueprints
45+
46+
[Sample Blueprint (Job Mode for Online Benchmarking)](online_inference_job.json)
47+
[Sample Config File ](example_online.yaml)
48+
49+
---
50+
51+
## Metrics Logged
52+
53+
- `output_tokens_per_second`
54+
- `requests_per_minute`
55+
- `overall_output_throughput`
56+
- All raw metrics from the `_summary.json` output of LLMPerf
57+
58+
---
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
benchmark_type: online
2+
3+
model: meta/llama3-8b-instruct
4+
input_len: 64
5+
output_len: 32
6+
max_requests: 5
7+
timeout: 300
8+
num_concurrent: 1
9+
results_dir: /workspace/results_on
10+
llm_api: openai
11+
llm_api_key: dummy-key
12+
llm_api_base: http://localhost:8001/v1
13+
14+
experiment_name: local-bench
15+
run_name: llama3-test
16+
mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000
17+
llmperf_path: /opt/llmperf-src
18+
metadata: test=localhost
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"recipe_id": "online_inference_benchmark",
3+
"recipe_mode": "job",
4+
"deployment_name": "Online Inference Benchmark",
5+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v2",
6+
"recipe_node_shape": "VM.GPU.A10.2",
7+
"input_object_storage": [
8+
{
9+
"par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Z2q73uuLCAxCbGXJ99CIeTxnCTNipsE-1xHE9HYfCz0RBYPTcCbqi9KHViUEH-Wq/n/iduyx1qnmway/b/mymodels/o/",
10+
"mount_location": "/models",
11+
"volume_size_in_gbs": 100,
12+
"include": ["example_online.yaml"]
13+
}
14+
],
15+
"recipe_container_command_args": ["/models/example_online.yaml"],
16+
"recipe_replica_count": 1,
17+
"recipe_container_port": "8000",
18+
"recipe_node_pool_size": 1,
19+
"recipe_node_boot_volume_size_in_gbs": 200,
20+
"recipe_ephemeral_storage_size": 100
21+
}

0 commit comments

Comments
 (0)