在 vps 上运行大语言模型

2024-12-03 约 844 字预计阅读 2 分钟

本文记录如何在 1C1G 的 VPS 上，使用 Ollama 运行 LLM 。

黑五购买了一台 1C1G 的 AMD Ryzen 9 7950X VPS，勉强可以玩玩 LLM 。记录下如何在这样的 VPS 上快速安装和运行 LLM。

配置

硬件配置

---------------------基础信息查询--感谢所有开源项目---------------------
 CPU 型号          : AMD Ryzen 9 7950X 16-Core Processor
 CPU 核心数        : 1
 CPU 频率          : 4491.540 MHz
 CPU 缓存          : L1: 64.00 KB / L2: 512.00 KB / L3: 16.00 MB
 AES-NI指令集      : ✔ Enabled
 VM-x/AMD-V支持    : ✔ Enabled
 内存              : 90.74 MiB / 960.70 MiB
 Swap              : 0 KiB / 2.00 MiB
 硬盘空间          : 1.12 GiB / 14.66 GiB
----------------------CPU测试--通过sysbench测试-------------------------
 -> CPU 测试中 (Fast Mode, 1-Pass @ 5sec)
 1 线程测试(单核)得分:          6402 Scores
---------------------内存测试--感谢lemonbench开源-----------------------
 -> 内存测试 Test (Fast Mode, 1-Pass @ 5sec)
 单线程读测试:          75694.60 MB/s
 单线程写测试:          42458.49 MB/s

软件配置

选择推理引擎：纯cpu推理，选择使用 Ollama 作为推理引擎。
选择模型： Qwen2.5-0.5b 模型 Q4 量化版本，模型大小不到 400MB，适合 1GB 内存。

Ollama

安装和运行模型

curl -fsSL https://ollama.com/install.sh | sh

ollama run qwen2.5:0.5b

进行对话

>>> hello, who are you?
I am Qwen, an AI language model developed by Alibaba Cloud. I was trained using millions of natural language processing (NLP) examples from the internet and my responses are generated through advanced neural network algorithms. My primary goal is to assist with tasks such as text generation, summarization, answering questions, and more. If you have any questions or need further clarification on a topic, feel free to ask!

要退出对话，请输入 /bye。

>>> /bye

性能测试

下载测试脚本

wget https://github.com/Yoosu-L/llmapibenchmark/releases/download/v1.0.1/llmapibenchmark_linux_amd64

设置脚本权限
```
chmod +x ./llmapibenchmark_linux_amd64
```

运行性能测试

./llmapibenchmark_linux_amd64 -base_url="http://127.0.0.1:11434/v1" -concurrency=1,2,4 #optional

输出示例

################################################################################################################
                                          LLM API Throughput Benchmark
                                    https://github.com/Yoosu-L/llmapibenchmark
                                         Time：2024-12-03 03:11:48 UTC+0
################################################################################################################
Input Tokens: 45
Output Tokens: 512
Test Model: qwen2.5:0.5b
Latency: 0.00 ms

| Concurrency | Generation Throughput (tokens/s) |  Prompt Throughput (tokens/s) | Min TTFT (s) | Max TTFT (s) |
|-------------|----------------------------------|-------------------------------|--------------|--------------|
|           1 |                            31.88 |                        976.60 |         0.05 |         0.05 |
|           2 |                            30.57 |                        565.40 |         0.07 |         0.16 |
|           4 |                            31.00 |                        717.96 |         0.11 |         0.25 |

卸载Ollama（如果你不需要了）

# 停止Ollama服务：
sudo systemctl stop ollama

# 禁用Ollama服务：
sudo systemctl disable ollama

# 删除Ollama服务文件：
sudo rm /etc/systemd/system/ollama.service

# 删除Ollama二进制文件：
sudo rm /usr/local/bin/ollama
# sudo rm /usr/bin/ollama
# sudo rm /bin/ollama

声明

本教程仅供娱乐，0.5b 的 LLM 难以达到生产要求，并且在推理时会造成大量 CPU 及内存带宽占用，影响邻居体验~~被删鸡~~。

目录