LLM API Performance Evaluation Tool Guide

Loouis included in AI

2025-02-13 491 words 3 minutes

Contents

This article provides a quick start guide for Linux and Windows platforms, including command examples for downloading, configuring, and running the tool.

This tool is used to benchmark LLM API performance, including prefill speed, decoding speed, Time-to-First-Token (TTFT), and overall latency. The code has been open-sourced on github. This article primarily introduces how to use the tool.

Project Repository: https://github.com/Yoosu-L/llmapibenchmark

Quick Start

Linux

# Download and grant execute permissions
wget https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.2/llmapibenchmark_linux_amd64
chmod +x ./llmapibenchmark_linux_amd64

# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 -> 128.
./llmapibenchmark_linux_amd64 -base_url=https://your-api-endpoint.com/v1

# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable prefill speed in a long-input, long-output scenario. Adjust numWords and concurrency as needed.
./llmapibenchmark_linux_amd64 -base_url=https://your-api-endpoint.com/v1 -numWords=6000 -concurrency=1,2,4,8,16,32

Windows

Download the latest version from the releases page.

# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable decoding speed (generation throughput) in a short-input, long-output scenario. Concurrency defaults from 1 -> 128.
llmapibenchmark_windows_amd64.exe -base_url=https://your-api-endpoint.com/v1

# Replace base_url with your API service URL (ending with /v1)
# This command primarily tests the maximum achievable prefill speed in a long-input, long-output scenario. Adjust numWords and concurrency as needed.
llmapibenchmark_windows_amd64.exe -base_url=https://your-api-endpoint.com/v1 -numWords=6000 -concurrency=1,2,4,8,16,32

Example Output

Real-time Terminal Output

################################################################################################################
                                          LLM API Throughput Benchmark
                                    https://github.com/Yoosu-L/llmapibenchmark
                                         Time：2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms

| Concurrency | Generation Throughput (tokens/s) |  Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
|           1 |                            31.88 |                        976.60 |         0.05 |         0.05 |
|           2 |                            30.57 |                        565.40 |         0.07 |         0.16 |
|           4 |                            31.00 |                        717.96 |         0.11 |         0.25 |

Markdown File

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-7B-Instruct-AWQ
Latency: 2.20 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	58.49	846.81	0.05	0.05
2	114.09	989.94	0.08	0.09
4	222.62	1193.99	0.11	0.15
8	414.35	1479.76	0.11	0.24
16	752.26	1543.29	0.13	0.47
32	653.94	1625.07	0.14	0.89

Advanced Parameters

Linux:

./llmapibenchmark_linux_amd64 \
  -base_url=https://your-api-endpoint.com/v1 \
  -apikey=YOUR_API_KEY \
  -model=gpt-3.5-turbo \
  -concurrency=1,2,4,8,16 \
  -max_tokens=512 \
  -numWords=513 \
  -prompt="Your custom prompt here"

Windows:

llmapibenchmark_windows_amd64.exe ^
  -base_url=https://your-api-endpoint.com/v1 ^
  -apikey=YOUR_API_KEY ^
  -model=gpt-3.5-turbo ^
  -concurrency=1,2,4,8,16 ^
  -max_tokens=512 ^
  -numWords=513 ^
  -prompt="Your custom prompt here"

Parameter Description

Parameter	Description	Default	Required
`-base_url`	Base URL for LLM API endpoint	Empty (MUST be specified)	Yes
`-apikey`	API authentication key	None	No
`-model`	Specific AI model to test	Automatically discovers first available model	No
`-concurrency`	Comma-separated concurrency levels to test	`1,2,4,8,16,32,64,128`	No
`-max_tokens`	Maximum tokens to generate per request	`512`	No
`-numWords`	Number of words for input prompt	Not set (optional)	No
`-prompt`	Text prompt for generating responses	`"Write a long story, no less than 10,000 words, starting from a long, long time ago."`	No