LLM API 性能测试工具使用指南

Loouis 收录于 AI

2025-02-13 约 692 字预计阅读 2 分钟

文章提供了 Linux 和 Windows 平台上的快速开始指南，包括下载、配置和运行工具的命令示例。

本工具是用作测试 LLM API 性能，包括预填充速度、解码速度、首字延迟、延迟等。目前代码已经开源在github上，本篇主要介绍该工具如何使用。

项目地址： https://github.com/Yoosu-L/llmapibenchmark

快速开始

Linux

# 下载并赋予可执行权限
wget https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.2/llmapibenchmark_linux_amd64
chmod +x ./llmapibenchmark_linux_amd64

# 将 base_url 替换为你的 api 服务 url，以v1结尾
# 这条主要测试总解码速度（生成吞吐）可达到的最大值，场景为短输入长输出，并发默认从1 -> 128
./llmapibenchmark_linux_amd64 -base_url=https://your-api-endpoint.com/v1

# 将 base_url 替换为你的 api 服务 url，以v1结尾
# 这条主要测试总预填充速度可达到的最大值，场景为长输入长输出，可适当改变 numWords 以及 concurrency level
./llmapibenchmark_linux_amd64 -base_url=https://your-api-endpoint.com/v1 -numWords=6000 -concurrency=1,2,4,8,16,32

Windows

在 release 界面下载最新版本

# 将 base_url 替换为你的 api 服务 url，以v1结尾
# 这条主要测试总解码速度（生成吞吐）可达到的最大值，场景为短输入长输出，并发默认从1 -> 128
llmapibenchmark_windows_amd64.exe -base_url=https://your-api-endpoint.com/v1

# 将 base_url 替换为你的 api 服务 url，以v1结尾
# 这条主要测试总预填充速度可达到的最大值，场景为长输入长输出，可适当改变 numWords 以及 concurrency level
llmapibenchmark_windows_amd64.exe -base_url=https://your-api-endpoint.com/v1 -numWords=6000 -concurrency=1,2,4,8,16,32

示例输出

终端实时输出

################################################################################################################
                                          LLM API Throughput Benchmark
                                    https://github.com/Yoosu-L/llmapibenchmark
                                         Time：2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms

| Concurrency | Generation Throughput (tokens/s) |  Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
|           1 |                            31.88 |                        976.60 |         0.05 |         0.05 |
|           2 |                            30.57 |                        565.40 |         0.07 |         0.16 |
|           4 |                            31.00 |                        717.96 |         0.11 |         0.25 |

md文件

Input Tokens: 45
Output Tokens: 512
Test Model: Qwen2.5-7B-Instruct-AWQ
Latency: 2.20 ms

Concurrency	Generation Throughput (tokens/s)	Prompt Throughput (tokens/s)	Min TTFT (s)	Max TTFT (s)
1	58.49	846.81	0.05	0.05
2	114.09	989.94	0.08	0.09
4	222.62	1193.99	0.11	0.15
8	414.35	1479.76	0.11	0.24
16	752.26	1543.29	0.13	0.47
32	653.94	1625.07	0.14	0.89

进阶参数

Linux:

./llmapibenchmark_linux_amd64 \
  -base_url=https://your-api-endpoint.com/v1 \
  -apikey=YOUR_API_KEY \
  -model=gpt-3.5-turbo \
  -concurrency=1,2,4,8,16 \
  -max_tokens=512 \
  -numWords=513 \
  -prompt="Your custom prompt here"

Windows:

llmapibenchmark_windows_amd64.exe ^
  -base_url=https://your-api-endpoint.com/v1 ^
  -apikey=YOUR_API_KEY ^
  -model=gpt-3.5-turbo ^
  -concurrency=1,2,4,8,16 ^
  -max_tokens=512 ^
  -numWords=513 ^
  -prompt="Your custom prompt here"

参数解析

Parameter	Description	Default	Required
`-base_url`	Base URL for LLM API endpoint	Empty (MUST be specified)	Yes
`-apikey`	API authentication key	None	No
`-model`	Specific AI model to test	Automatically discovers first available model	No
`-concurrency`	Comma-separated concurrency levels to test	`1,2,4,8,16,32,64,128`	No
`-max_tokens`	Maximum tokens to generate per request	`512`	No
`-numWords`	Number of words for input prompt	Not set (optional)	No
`-prompt`	Text prompt for generating responses	`"Write a long story, no less than 10,000 words, starting from a long, long time ago."`	No

目录

LLM API 性能测试工具使用指南

快速开始

Linux

Windows

示例输出

终端实时输出

md文件

进阶参数

参数解析