OMLX 面试题库

30 道题

分类: LLM
题目数: 30 道

已阅读 0 / 30 题

1 OMLX 的核心定位是什么？与原生 Ollama 的兼容性设计理念如何体现？

答案：

OMLX 是 Ollama 兼容的高性能推理平台，定位为原生 Ollama 的企业级替代方案。核心设计理念是在完整兼容 Ollama Client API 的前提下，引入多节点分布式部署、多推理后端、GPU 资源池化和企业级运维能力。

兼容性设计原则

设计原则	实现方式	价值
API 透传兼容	Server 组件完整实现 Ollama REST API 所有端点	现有 Ollama 客户端无需任何修改即可接入
Modelfile 兼容	完整支持 Ollama Modelfile 语法和模型构建流程	已有 Modelfile 可直接复用
模型格式兼容	支持 GGUF 格式，兼容 Ollama 模型仓库	已有 GGUF 模型无需转换
客户端 SDK 兼容	兼容 Ollama Python / JavaScript SDK	应用层代码零改动迁移

OMLX 与原生 Ollama 的分层差异

graph LR
    subgraph OLLAMA["原生 Ollama（单节点）"]
        OC["Ollama Client"]
        OAPI["Ollama API Server<br/>+ Runner<br/>+ Model Manager"]
        OBACK["llama.cpp Backend"]
        OGPU["单 GPU / CPU"]
        OC --> OAPI --> OBACK --> OGPU
    end

    subgraph OMLX["OMLX（企业级平台）"]
        MC["Ollama Client / SDK"]
        MS["OMLX Server（API 兼容层）"]
        MG["OMLX Gateway（路由 / 负载均衡）"]
        MCTRL["Controller（调度 / 编排）"]
        MW["Worker x N（多后端推理节点）"]
        MBACK1["llama.cpp"]
        MBACK2["vLLM"]
        MBACK3["SGLang"]
        MBACK4["TensorRT-LLM"]
        MGPU["多 GPU / 多节点 / K8s 集群"]
        MC --> MS --> MG --> MCTRL --> MW
        MW --> MBACK1
        MW --> MBACK2
        MW --> MBACK3
        MW --> MBACK4
        MW --> MGPU
    end

OMLX 的核心竞争力在于：将 Ollama 的易用性从单节点桌面场景扩展到企业级多节点生产场景，同时保持 API 层面完全兼容，避免用户锁定任何特定推理后端。

2 OMLX 的架构组件有哪些？Server / Gateway / Worker / Controller 各自承担什么职责？

答案：

OMLX 采用四层分离架构，各组件职责独立，通过 gRPC 或 HTTP 协议进行组件间通信，支持独立扩缩容。

架构组件职责矩阵

组件	职责	暴露端口	依赖组件
Server	暴露 Ollama 兼容 REST API，处理客户端请求接入、响应流式输出	11434（HTTP）	Gateway
Gateway	请求路由、负载均衡、会话保持、速率限制、模型到 Worker 的映射	11435（gRPC）	Controller
Controller	集群状态管理、Worker 注册与健康检查、模型调度决策、配置下发	11436（gRPC）	—
Worker	模型加载、推理执行、GPU 资源管理、显存管理	11437（gRPC）	—

组件交互流程

graph TD
    Client["Client Request<br/>（POST /api/generate）"]
    Server["Server（API 兼容层）<br/>解析请求 → 提取 model / prompt / parameters<br/>验证 API Key → 查询 Gateway 路由表"]
    Gateway["Gateway（路由层）<br/>根据 model 名查找可用 Worker 列表<br/>负载均衡选择一个 Worker<br/>会话保持：将同一会话路由至同一 Worker"]
    Controller["Controller（调度层）<br/>Worker 健康状态跟踪<br/>模型加载 / 卸载决策<br/>将路由决策下发至 Gateway"]
    Worker["Worker（推理层）<br/>检查模型是否已加载到显存<br/>若未加载：从模型仓库拉取并加载<br/>执行推理 → 返回 token 流<br/>管理显存：Keep-Alive 计时 / Idle Unload"]
    Response["Server ← Gateway → Client<br/>（流式返回 SSE）"]

    Client --> Server
    Server --> Gateway
    Gateway --> Controller
    Controller --> Worker
    Worker --> Response

部署模式

模式	适用场景	组件分布
All-in-One	单节点开发 / 测试	所有组件打包在同一进程
Distributed	生产环境多节点	各组件独立部署，通过配置发现对方
K8s Native	Kubernetes 集群	每组组件作为独立 Deployment / StatefulSet

3 OMLX 与原生 Ollama 的 API 兼容性如何？Generate / Chat / Embeddings / Pull / Push / List 接口的兼容细节是什么？

答案：

OMLX 在 HTTP API 层面实现了与 Ollama 的完整兼容，覆盖所有核心端点，请求与响应格式完全一致。

API 兼容性清单

端点	HTTP Method	Ollama 原生	OMLX 兼容	备注
`/api/generate`	POST	支持	完全兼容	文本生成（completion），支持 stream/non-stream
`/api/chat`	POST	支持	完全兼容	对话补全，兼容 OpenAI Chat 格式
`/api/embeddings`	POST	支持	完全兼容	文本向量化
`/api/pull`	POST	支持	完全兼容	拉取模型，兼容 OCI registry
`/api/push`	POST	支持	完全兼容	推送模型至 registry
`/api/list`	GET	支持	完全兼容	列出本地模型
`/api/show`	POST	支持	完全兼容	查看模型详情（Modelfile / parameters）
`/api/copy`	POST	支持	完全兼容	复制模型
`/api/delete`	DELETE	支持	完全兼容	删除模型
`/api/tags`	GET	支持	完全兼容	列出模型标签
`/api/version`	GET	支持	完全兼容	返回版本信息
`/api/ps`	GET	支持	扩展	在 OMLX 中额外返回 Worker 分布信息

请求 / 响应兼容性示例

# Generate 请求（完全兼容 Ollama 客户端）
curl http://omlx-server:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "Why is the sky blue?",
  "stream": false,
  "options": {
    "temperature": 0.7,
    "num_predict": 256
  }
}'

# Chat 请求
curl http://omlx-server:11434/api/chat -d '{
  "model": "qwen2.5:7b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": true
}'

# Embedding 请求
curl http://omlx-server:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Hello world"
}'

OMLX 扩展参数（兼容前提下增强）

OMLX 在 options 字段中新增可选参数，不传则完全走 Ollama 默认行为：

扩展参数	类型	默认值	说明
`omlx_worker_label`	string	自动	指定 Worker 标签路由
`omlx_backend`	string	llama.cpp	指定推理后端（llama.cpp / vllm / sglang）
`omlx_priority`	int	0	请求优先级（0-10）
`omlx_timeout`	int	300	推理超时（秒）

这些扩展参数不影响与原生 Ollama 客户端的兼容性 – 不传则完全以 Ollama 默认行为运行。

4 OMLX 的多模型并发 Serving 机制是如何实现的？

答案：

OMLX 通过 Worker 级别的模型管理池实现多模型并发 Serving，核心机制是：每个 Worker 节点独立维护已加载模型集合，Controller 全局协调模型到 Worker 的映射，Gateway 根据请求中的 model 参数将流量路由至正确的 Worker。

多模型并发架构

graph TD
    GW["Gateway<br/>请求路由器<br/>Model -> Worker 映射表"]
    GW --> WA["Worker A"]
    GW --> WB["Worker B"]
    subgraph WA["Worker A"]
        QW1["qwen2.5:7b (显存 6GB)"]
        NE1["nomic-embed (显存 2GB)"]
    end
    subgraph WB["Worker B"]
        LL["llama3:8b (显存 5GB)"]
        CL["codellama:7b (显存 5GB)"]
    end

并发控制策略

策略维度	机制	说明
模型加载	Worker 启动时预加载指定模型列表，或按需懒加载	首次请求路由至对应 Worker 时触发模型加载
并发推理	单个 Worker 内支持多请求共享同一模型实例	利用批处理（Dynamic Batching）提升 GPU 利用率
显存管理	当 Worker 显存不足以加载新模型时，自动卸载最近最少使用（LRU）的模型	Keep-Alive 参数控制卸载延迟
请求排队	Gateway 维护每个模型的请求队列，超过 Worker 并发上限时排队等待	可配置队列最大长度和超时
亲和性路由	同一 Session 的请求路由至同一 Worker	复用 KV Cache，减少重复计算

模型路由决策流程

graph TD
    Gateway["Gateway 收到 /api/chat model=qwen2.5:7b"]
    Query["查询 Model → Worker 映射表"]
    List["qwen2.5:7b → [Worker-A, Worker-B]"]
    LB["负载均衡选择 Worker-A"]
    Check["Worker-A 检查显存中是否已加载 qwen2.5:7b"]
    Loaded["已加载：直接推理，重置 Keep-Alive 计时器"]
    NotLoaded{"未加载"}
    VRAM_OK["显存充足：加载模型 → 推理"]
    VRAM_FULL["显存不足：LRU 卸载最久未使用模型 → 加载 → 推理"]

    Gateway --> Query
    Query --> List
    List --> LB
    LB --> Check
    Check -->|"已加载"| Loaded
    Check -->|"未加载"| NotLoaded
    NotLoaded -->|"显存充足"| VRAM_OK
    NotLoaded -->|"显存不足"| VRAM_FULL

生产部署建议

每个 Worker 节点仅加载少量常用模型（如 2-4 个 7B 级别模型），避免因频繁模型卸载导致首 Token 延迟（TTFT）显著增加。对于低频模型，通过模型预加载 + Keep-Alive 延长策略降低冷启动频率。

5 OMLX 的模型管理机制是什么？Pull / Push / List / Copy / Delete 与 OCI 兼容性如何？

答案：

OMLX 的模型管理兼容 Ollama 的 OCI（Open Container Initiative）分发体系，所有模型以 OCI Artifact 格式存储在兼容 OCI 的 Registry 中，支持 Pull / Push / List / Copy / Delete 全生命周期操作。

模型管理命令对照

操作	Ollama CLI	OMLX API	OCI 兼容
拉取模型	`ollama pull qwen2.5:7b`	`POST /api/pull`	兼容 OCI Distribution Spec
推送模型	`ollama push`	`POST /api/push`	兼容 OCI Push 流程
列出模型	`ollama list`	`GET /api/list`	本地索引查询
复制模型	`ollama cp src dst`	`POST /api/copy`	仅操作本地索引
删除模型	`ollama rm qwen2.5:7b`	`DELETE /api/delete`	删除本地文件 + 索引

OCI 兼容分发架构

OMLX 模型管理
├── 模型存储层
│   ├── 本地存储：/var/lib/omlx/models/
│   │   ├── blobs/         # 模型层文件（GGUF / safetensors）
│   │   └── manifests/     # OCI 清单
│   └── 远程存储：OCI Registry（Harbor / Docker Hub / GHCR）
│       ├── 模型层（Layer）
│       └── 模型配置（Config）
├── 分发协议
│   ├── OCI Distribution Spec v1.1
│   ├── 支持 Registry 认证（Bearer Token / Basic Auth）
│   └── 支持分层拉取（断点续传 / 并行下载）
└── 模型索引
    ├── 模型名 → 标签 → Digest 映射
    └── GGUF 文件 → 参数映射（量化级别 / 上下文长度）

Pull 操作流程

graph TD
    Client["Client → POST /api/pull model: qwen2.5:7b"]
    Server["OMLX Server → 解析模型名 → 查询 Registry"]
    Manifest["Registry 返回 Manifest（Layer Digest 列表）"]
    Cache{"OMLX Server → 检查本地缓存（blobs 目录）"}
    Cached["已缓存：跳过下载"]
    NotCached["未缓存：并行下载 Layer → 校验 SHA256 Digest"]
    Extract["解压 / 放置 GGUF 文件到模型目录"]
    Done["更新本地模型索引 → 返回成功"]

    Client --> Server
    Server --> Manifest
    Manifest --> Cache
    Cache -->|"已缓存"| Cached
    Cache -->|"未缓存"| NotCached
    Cached --> Extract
    NotCached --> Extract
    Extract --> Done

模型版本管理

{registry}/{namespace}/{model}:{tag}
├── library/qwen2.5:7b          → 默认 7B-Q4_K_M 量化
├── library/qwen2.5:7b-q8_0     → Q8_0 量化版本
├── library/qwen2.5:7b-fp16     → FP16 全精度
└── library/qwen2.5:7b-instruct → 指令微调版本

OMLX 扩展了 Ollama 的模型管理能力：支持在 Pull 时指定 Registry 地址（POST /api/pull {"model": "harbor.example.com/models/qwen2.5:7b"}），支持从私有 Registry 拉取时传递认证凭证。

6 OMLX 的 GGUF 格式支持与量化选项有哪些？

答案：

OMLX 原生支持 GGUF（GPT-Generated Unified Format）作为模型存储和分发的标准格式，覆盖从 FP16 到极低比特量化的完整精度谱系。

GGUF 格式特性

特性	说明
单文件分发	模型权重 + 分词器 + 元数据打包为单一 .gguf 文件
惰性加载	支持 mmap 内存映射，无需将全部权重读入内存即可推理
类型感知	内置张量类型信息，自动选择合适的计算精度
元数据丰富	包含架构名称、上下文长度、分词器类型、对话模板等关键信息

量化选项矩阵

量化级别	每参数位宽	7B 模型大小（约）	质量影响	适用场景
FP16	16 bit	~14 GB	无损失	基准测试 / 精度敏感任务
Q8_0	8 bit	~7 GB	极低	高精度推理 / 企业应用
Q6_K	6 bit	~5.5 GB	极低	GPU 显存受限场景
Q5_K_M	5 bit	~5 GB	低	平衡质量与速度
Q4_K_M	4 bit	~4.5 GB	中低	单 GPU 部署优选
Q4_0	4 bit	~4.2 GB	中	消费级 GPU / 批量推理
Q3_K_M	3 bit	~3.5 GB	中高	边缘设备
Q2_K	2 bit	~2.8 GB	高	仅测试用 / CPU 推理
IQ4_XS	4 bit（重要性矩阵）	~4.5 GB	低于普通 Q4	追求极致质量的下位替代

K-quant 变体说明

变体	含义	特点
_L	Large	更多 key-value 权重保留较高精度
_M	Medium	均衡方案
_S	Small	更激进压缩

OMLX 中指定量化版本

# Pull 指定量化版本
curl http://omlx:11434/api/pull -d '{"model": "qwen2.5:7b-q4_k_m"}'

# 或在 Modelfile 中指定
FROM qwen2.5:7b-q8_0

# OMLX 扩展：列出模型的可用量化版本
curl http://omlx:11434/api/show -d '{"model": "qwen2.5:7b", "verbose": true}'

OMLX 量化选项自动推荐

OMLX 支持基于 Worker 可用显存自动选择最优量化级别。当 Pull 模型时不指定量化标签，Controller 根据目标 Worker 的 GPU 型号和可用显存推荐最佳量化级别。

7 OMLX 的模型缓存与存储管理机制是什么？

答案：

OMLX 的模型缓存与存储管理采用分层缓存架构，通过本地文件系统 + 分布式存储两种模式管理模型生命周期，核心目标是减少模型拉取延迟和显存浪费。

存储分层架构

graph TD
    Remote["远程层（OCI Registry）<br/>模型层文件（GGUF / safetensors）<br/>OCI Manifest"]
    Shared["共享缓存层（可选，如 MinIO / S3 / CephFS）<br/>集群内共享模型文件<br/>避免每个 Worker 重复下载相同模型"]
    Local["本地缓存层（Worker 节点本地）<br/>blobs：OCI Layer 缓存<br/>models：已解压模型<br/>metadata：模型元数据"]
    VRAM["显存层（GPU VRAM）<br/>已加载模型（活跃服务中）<br/>Keep-Alive 计时器中"]

    Remote --> Shared
    Shared --> Local
    Local --> VRAM

缓存策略

策略	配置参数	默认值	说明
本地缓存上限	`cache_disk_limit`	100GB	超过上限时自动清理最久未使用文件
OCI Layer 保留	`cache_ttl`	7d	OCI 层文件过期后删除
预缓存模型	`precache_models`	[]	启动时主动拉取的模型列表
共享存储路径	`shared_cache_path`	—	分布式文件系统挂载点
缓存预热	`cache_warmup`	true	Worker 启动时预加载常用模型到显存

存储清理命令

# 列出模型占用空间
ollama list --omlx-stats
# 输出示例：
# qwen2.5:7b-q4_k_m    4.5 GB  (磁盘)  +  5.8 GB  (显存)  Worker: node-gpu-01

# 清理未使用模型（仅删除磁盘缓存，不影响显存中的活跃模型）
curl -X DELETE http://omlx:11434/api/delete -d '{"model": "unused-model"}'

# OMLX 扩展：批量清理过期缓存
curl -X POST http://omlx:11434/api/omlx/cache/prune -d '{"older_than": "7d"}'

多 Worker 缓存同步

Worker-A 首次加载 qwen2.5:7b
  → 检查本地缓存：未命中
  → 检查共享缓存（MinIO）：命中
  → 从共享缓存读取 GGUF 文件
  → 加载到显存
  → 注册到 Controller

Worker-B 请求相同模型
  → 检查本地缓存：未命中
  → 检查共享缓存（MinIO）：命中
  → 无需重复 Pull，直接加载
  → 加载到显存

8 OMLX 的 GPU 加速与多 GPU 推理是如何实现的？

答案：

OMLX 支持单节点多 GPU 推理（数据并行 / 张量并行），以及多节点分布式推理。不同推理后端提供不同粒度的多 GPU 策略。

多 GPU 策略对比

策略	原理	适用后端	GPU 间通信	适用场景
Data Parallel（数据并行）	每张 GPU 加载完整模型副本，请求分发至不同 GPU	llama.cpp / vLLM	无需通信	吞吐优先，多并发请求
Tensor Parallel（张量并行）	模型层按张量维度切分到多张 GPU	vLLM / SGLang / TensorRT-LLM	NVLink / PCIe	单模型超出单卡显存
Pipeline Parallel（流水线并行）	模型按层分段，不同 GPU 负责不同层	vLLM / SGLang	NVLink / PCIe	超大规模模型
Expert Parallel（专家并行）	MoE 模型中不同 Expert 分布到不同 GPU	vLLM / SGLang	NVLink	MoE 模型（如 Mixtral）

OMLX 多 GPU 配置

# Worker 配置
gpu:
  devices: [0, 1, 2, 3]         # 使用的 GPU 设备列表
  tensor_parallel_size: 2        # 张量并行度（每 2 张 GPU 一组）
  pipeline_parallel_size: 1      # 流水线并行度
  data_parallel_size: 2          # 数据并行副本数（2 组张量并行组）
  # 以上配置对 4 张 GPU：TP=2 × DP=2 → 可同时处理 2 个独立请求

数据并行：高并发吞吐

Worker（4 × GPU）
├── GPU 0：qwen2.5:7b 完整副本 → 处理 Req-1, Req-4
├── GPU 1：qwen2.5:7b 完整副本 → 处理 Req-2, Req-5
├── GPU 2：qwen2.5:7b 完整副本 → 处理 Req-3, Req-6
└── GPU 3：qwen2.5:7b 完整副本 → 处理 Req-7

张量并行：大模型单片显存不足

Worker（2 × GPU，TP=2）
├── GPU 0：qwen2.5:72b Layer 0-39（前半）
└── GPU 1：qwen2.5:72b Layer 40-79（后半）
        
单请求 → 同时使用 GPU 0 + GPU 1
GPU 0 计算前半层输出 → NVLink/PCIe → GPU 1 继续计算

GPU 亲和性配置

worker:
  gpu_selector:
    strategy: "topology"         # 就近 NUMA 节点匹配
    prefer_nvlink: true          # 优先选择 NVLink 互联的 GPU 对
    max_gpu_per_model: 4         # 单模型最大使用 GPU 数量

多 GPU 推理性能基准参考

模型	GPU 配置	策略	Throughput（tokens/s）	Latency（TTFT）
qwen2.5:7b	1×A10	单卡	~120	~200ms
qwen2.5:7b	4×A10	DP=4	~450	~220ms
qwen2.5:72b	2×A100-80G	TP=2	~45	~800ms
llama3:70b	4×A100-80G	TP=4	~55	~600ms

9 OMLX 的显存管理与模型卸载机制（Keep-Alive / Idle Unload）是如何工作的？

答案：

OMLX 的显存管理通过 Keep-Alive 计时器和 Idle Unload 策略自动管理 GPU 显存中的模型生命周期，平衡推理响应速度（避免重复加载）与显存利用率（释放闲置模型）。

显存生命周期

模型加载到显存
  │
  ├── Active（活跃）：有请求正在推理
  │     └── 每个请求完成后重置 Keep-Alive 计时器
  │
  ├── Keep-Alive（保持期）：最近一次请求完成起计时
  │     ├── 计时器未到期：模型保持显存，可立即响应新请求
  │     └── 计时器到期 → Idle 状态
  │
  └── Idle（空闲）：Keep-Alive 到期，尚未卸载
        ├── 有新请求到达 → 无需重新加载，响应延迟 ≈ 0
        └── 显存不足，需要加载新模型 → 卸载此模型 → 释放显存

Keep-Alive 配置

配置项	默认值	说明
`keep_alive`	5m	模型在最后一次请求完成后在显存中保留的时长
`keep_alive_per_model`	—	按模型粒度覆盖全局 Keep-Alive
`max_loaded_models`	0（不限制）	单 Worker 最大同时加载模型数
`vram_limit`	GPU 全部显存	Worker 使用的显存上限
`idle_unload_threshold`	0.85	显存占用超过此比例时主动卸载 Idle 模型

显存管理决策树

graph TD
    Request["新模型加载请求到达 Worker"]
    Calc["计算所需显存 = 模型权重 + KV Cache 预分配"]
    Check{"当前可用显存 >= 所需显存？"}
    Load["直接加载"]
    NeedFree{"需要释放显存"}
    HasIdle{"有 Idle 模型？"}
    LRU["LRU 卸载 Idle 模型 → 加载新模型"]
    OOM["返回错误（GPU OOM）"]
    HasKA{"有 Keep-Alive 模型且紧急（priority=10）？"}
    UnloadKA["卸载 Keep-Alive 模型 → 加载新模型"]

    Request --> Calc
    Calc --> Check
    Check -->|"是"| Load
    Check -->|"否"| NeedFree
    NeedFree --> HasIdle
    HasIdle -->|"是"| LRU
    HasIdle -->|"否"| OOM
    HasIdle -->|"否"| HasKA
    HasKA -->|"是"| UnloadKA
    HasKA -->|"否"| OOM
    LRU --> Load
    UnloadKA --> Load

API 参数控制

# 请求中指定 Keep-Alive
curl http://omlx:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "Hello",
  "keep_alive": "30m"           # 请求完成后保留 30 分钟
}'

# OMLX 扩展：手动卸载模型
curl -X POST http://omlx:11434/api/omlx/unload -d '{"model": "qwen2.5:7b"}'

# OMLX 扩展：查询显存使用状态
curl http://omlx:11434/api/omlx/vram/status
# 响应：
# {
#   "worker": "node-gpu-01",
#   "total_vram_gb": 80,
#   "used_vram_gb": 54.2,
#   "loaded_models": [
#     {"name": "qwen2.5:72b", "vram_gb": 45.5, "state": "active"},
#     {"name": "qwen2.5:7b", "vram_gb": 5.8, "state": "keep-alive", "ttl": "3m"},
#     {"name": "nomic-embed", "vram_gb": 1.5, "state": "idle"}
#   ]
# }

生产环境建议

高频模型设置较长的 Keep-Alive（如 30m-1h），低频模型设置为 1m-5m 或按需加载。结合空闲时段主动卸载策略（CronJob），在夜间低峰期释放显存资源。

10 OMLX 支持的推理后端有哪些？llama.cpp / vLLM / SGLang / TensorRT-LLM 各有什么特点？

答案：

OMLX 采用后端抽象层设计，支持多种推理引擎作为 Worker 的底层推理运行时。用户在 Modelfile 或 API 参数中指定后端，OMLX 自动选择对应的 Worker 节点。

推理后端对比

特性	llama.cpp	vLLM	SGLang	TensorRT-LLM
核心优势	CPU/GPU 通用、GGUF 原生支持、量化方案丰富	PagedAttention、Continuous Batching、高吞吐	结构化生成、RadixAttention、Prefix Caching 高效	极致 GPU 优化、低延迟、TensorRT 编译
模型格式	GGUF	HuggingFace safetensors	HuggingFace safetensors	TensorRT Engine
量化支持	Q2_K-Q8_0、FP16	FP16/BF16、AWQ/GPTQ/FP8	FP16/BF16、AWQ/GPTQ	FP16/BF16、INT8/INT4/FP8
PagedAttention	不支持	原生支持	RadixAttention 支持	支持
Prefix Caching	支持（llama.cpp Cache）	Automatic Prefix Caching	RadixAttention 原生	支持
Continuous Batching	有限支持	原生支持	原生支持	原生支持
多 GPU	有限（CUDA 图）	TP/PP/DP	TP/PP/DP	TP/PP
适用场景	通用推理、消费级 GPU、边缘设备、量化模型	高吞吐在线推理、API 服务	结构化输出、长上下文、多轮对话	低延迟实时推理、嵌入式部署

后端选择指南

后端选择决策树：

模型是否为 GGUF 格式？
├── 是 → llama.cpp（天然兼容）
└── 否（HuggingFace safetensors）
      ├── 追求最高吞吐？ → vLLM
      ├── 需要结构化生成（JSON / Regex）？ → SGLang
      ├── 追求最低延迟？ → TensorRT-LLM
      ├── 长上下文多轮对话？ → SGLang（RadixAttention 优势）
      └── 通用在线推理？ → vLLM（生态最成熟）

OMLX 后端配置

# Modelfile 中指定后端
FROM qwen2.5:7b
BACKEND vllm                  # 指定推理后端
PARAMETER tensor_parallel_size 2

# API 请求中覆盖后端
curl http://omlx:11434/api/chat -d '{
  "model": "qwen2.5:7b",
  "messages": [{"role": "user", "content": "Hello"}],
  "options": {"omlx_backend": "sglang"}
}'

# Worker 配置中声明支持的后端
worker:
  backends:
    - name: vllm
      enabled: true
      models_path: /models/huggingface/
    - name: llama.cpp
      enabled: true
      models_path: /models/gguf/
    - name: sglang
      enabled: false

后端抽象层设计

OMLX Worker 推理抽象层
  │
  ├── Backend Interface（统一接口）
  │   ├── load(model_path, config) → ModelHandle
  │   ├── generate(handle, prompt, params) → Response
  │   ├── chat(handle, messages, params) → Response
  │   ├── embeddings(handle, texts) → []float32
  │   └── unload(handle)
  │
  └── Backend Implementation（各后端实现）
      ├── LlamaCppBackend
      ├── VLLMBackend
      ├── SGLangBackend
      └── TensorRTLLMBackend

11 OMLX 在 Kubernetes 上的部署架构是怎样的？

答案：

OMLX 在 Kubernetes 上的部署采用微服务化架构，各组件的 Kubernetes 资源选型根据其职责特征选择不同的工作负载类型。

K8s 部署架构全景

graph TD
    INGRESS["Ingress / Gateway<br/>(Nginx / Traefik / Istio)<br/>:11434 (HTTP)"]

    subgraph CONTROL["控制平面"]
        S1["Server Deployment<br/>(Replicas: 2-3)<br/>HPA: CPU > 70%"]
        S2["Server Deployment<br/>(多副本水平扩展)"]
    end

    GW["Gateway Deployment<br/>(Replicas: 2-3)<br/>HPA: RPS > 100"]

    CTRL["Controller Deployment<br/>(Replicas: 1)<br/>单实例主节点"]

    subgraph WORKERS["Worker StatefulSet (Replicas: N)"]
        W0["Worker-0<br/>GPU: A100"]
        W1["Worker-1<br/>GPU: A10"]
        W2["Worker-2<br/>GPU: A10"]
        W3["Worker-3<br/>GPU: A10"]
    end

    subgraph STORAGE["Shared Storage"]
        MINIO["MinIO / S3 / CephFS (模型缓存)"]
        HARBOR["Harbor (模型 Registry)"]
    end

    INGRESS --> CONTROL
    CONTROL --> GW
    GW --> CTRL
    GW --> WORKERS
    WORKERS --> STORAGE

组件 Kubernetes 资源选型

组件	工作负载类型	副本数	扩缩策略	关键配置
Server	Deployment	2+	HPA（CPU/RPS）	Service（ClusterIP）+ Ingress
Gateway	Deployment	2+	HPA（RPS）	无状态，水平扩展
Controller	Deployment	1	不扩缩	单实例 + Leader Election，故障自动重建
Worker	StatefulSet	N	手动 / KEDA	固定标识 + GPU nodeSelector，每个 Pod 对应特定 GPU

Worker StatefulSet 优势

Worker 使用 StatefulSet 而非 Deployment 的原因：

固定 Pod 标识：Worker-0、Worker-1 标识稳定，Controller 的模型 → Worker 映射关系不会因重启而失效。
持久化存储：每个 Worker 绑定独立 PVC 存储本地缓存，Pod 重建后缓存不丢失。
顺序管理：滚动更新时按序号逐个更新，确保任一时刻仍有 Worker 可用。
GPU 亲和性：通过 nodeSelector + topologySpreadConstraints 确保 Worker 分布在正确的 GPU 节点上。

GPU 节点要求

# GPU Worker Pod 配置示例
apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    node.kubernetes.io/instance-type: gpu-worker
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  containers:
    - name: worker
      resources:
        limits:
          nvidia.com/gpu: 1

12 OMLX 的 Helm Chart 部署与配置方式是什么？

答案：

OMLX 提供官方 Helm Chart，通过 values.yaml 集中管理所有组件的部署参数、资源规格、高可用配置和监控集成。

安装流程

# 添加 OMLX Helm 仓库
helm repo add omlx https://charts.omlx.dev/stable
helm repo update

# 查看可用版本
helm search repo omlx

# 安装 OMLX Stack
helm install omlx omlx/omlx-stack \
  --namespace omlx-system \
  --create-namespace \
  --values custom-values.yaml

# 升级
helm upgrade omlx omlx/omlx-stack \
  --namespace omlx-system \
  --values custom-values.yaml

核心 values.yaml 配置项

# 全局配置
global:
  imageRegistry: "docker.io/omlx"
  imageTag: "v0.4.0"
  storageClass: "gp3"
  modelRegistry: "harbor.example.com/models"

# Server 组件
server:
  replicas: 3
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "8Gi"
  hpa:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
  ingress:
    enabled: true
    className: "nginx"
    hosts:
      - omlx.example.com
    tls:
      - secretName: omlx-tls
        hosts:
          - omlx.example.com

# Gateway 组件
gateway:
  replicas: 3
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
  config:
    routeStrategy: "least-loaded"      # round-robin / least-loaded / random
    rateLimit:
      enabled: true
      requestsPerSecond: 100
    queueMaxSize: 1000

# Controller 组件
controller:
  replicas: 1
  config:
    leaderElection: true
    healthCheckInterval: 10s
    modelLoadTimeout: 300s

# Worker 组件
worker:
  replicas: 4
  resources:
    limits:
      nvidia.com/gpu: 1
    requests:
      cpu: "8"
      memory: "32Gi"
  config:
    keepAlive: "15m"
    maxLoadedModels: 4
    vramLimit: "38Gi"                # 单 Worker 显存上限
    backends:
      - name: "llama.cpp"
        enabled: true
      - name: "vllm"
        enabled: true
  nodeSelector:
    node.kubernetes.io/instance-type: "gpu-worker"
  persistence:
    enabled: true
    size: "200Gi"
    storageClass: "gp3"

# 模型预加载
models:
  preload:
    - "qwen2.5:7b-q4_k_m"
    - "nomic-embed-text"
    - "llama3.1:8b-q4_k_m"

# 监控集成
monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: "30s"
  grafanaDashboard:
    enabled: true
    labels:
      grafana_dashboard: "1"

# 认证配置
auth:
  enabled: true
  apiKeys:
    - name: "admin-key"
      key: "omlx-admin-xxxxx"
      role: "admin"
  oidc:
    enabled: false

多环境 values 管理

# 目录结构
omlx-deploy/
├── base-values.yaml          # 公共基础配置
├── dev-values.yaml           # 开发环境覆盖
├── staging-values.yaml       # 预发环境覆盖
└── prod-values.yaml          # 生产环境覆盖

# 部署命令（Helm 合并多 values 文件）
helm install omlx omlx/omlx-stack \
  -f base-values.yaml \
  -f prod-values.yaml \
  --namespace omlx-system

Chart 依赖

OMLX Helm Chart 可选依赖：

依赖 Chart	用途	是否必须
NVIDIA GPU Operator	GPU 驱动与设备插件管理	推荐（若集群未安装）
MinIO	共享模型缓存存储	可选（多节点部署推荐）
Prometheus Stack	监控指标采集	可选（监控集成推荐）
cert-manager	TLS 证书管理	可选（生产环境推荐）

13 OMLX 的 Modelfile 自定义模型构建流程是什么？

答案：

OMLX 完整兼容 Ollama Modelfile 格式，通过 FROM、PARAMETER、SYSTEM、TEMPLATE、ADAPTER 等指令定义模型配置，支持从 GGUF 文件或 HuggingFace 模型构建自定义模型。

Modelfile 完整指令

指令	说明	示例
`FROM`	基础模型来源（GGUF 文件 / Registry 路径 / HuggingFace）	`FROM qwen2.5:7b` / `FROM ./model.gguf` / `FROM hf://Qwen/Qwen2.5-7B-Instruct`
`PARAMETER`	推理参数（temperature / top_p / num_predict 等）	`PARAMETER temperature 0.7`
`SYSTEM`	System Prompt	`SYSTEM "You are a helpful code assistant."`
`TEMPLATE`	Chat Template（Go template 语法）	`TEMPLATE """{{ .System }}..."""`
`ADAPTER`	LoRA Adapter 文件	`ADAPTER ./lora-adapter.bin`
`LICENSE`	模型许可证	`LICENSE "Apache 2.0"`
`MESSAGE`	对话示例	`MESSAGE user "Hello"` / `MESSAGE assistant "Hi!"`
`BACKEND`	OMLX 扩展：推理后端指定	`BACKEND vllm`
`GPU_LAYERS`	OMLX 扩展：指定 GPU 层数	`GPU_LAYERS 33`

Modelfile 示例

# OMLX Modelfile - Qwen2.5 代码助手
FROM qwen2.5:7b-instruct-q8_0

# 推理参数
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_predict 2048
PARAMETER repeat_penalty 1.1
PARAMETER stop "</s>"
PARAMETER stop "User:"
PARAMETER stop "Assistant:"

# OMLX 扩展：指定推理后端
BACKEND vllm
PARAMETER tensor_parallel_size 1

# OMLX 扩展：显存管理
PARAMETER omlx_keep_alive "30m"

# 系统提示词
SYSTEM """You are an expert software engineer assistant specializing in Go and Python.
Follow these rules:
1. Write clean, idiomatic code
2. Include error handling
3. Add brief comments for complex logic
4. Prefer standard library over third-party packages"""

# 对话模板
TEMPLATE """XQOPEN if .System XQCLOSE<|system|>
XQOPEN .System XQCLOSE<|end|>
XQOPEN end XQCLOSEXQOPEN if .Prompt XQCLOSE<|user|>
XQOPEN .Prompt XQCLOSE<|end|>
XQOPEN end XQCLOSE<|assistant|>
XQOPEN .Response XQCLOSE<|end|>"""

# 对话示例
MESSAGE user "Write a Go function to read a YAML config file"
MESSAGE assistant "```go\npackage config\n\nimport (\n    \"os\"\n    \"gopkg.in/yaml.v3\"\n)\n\ntype Config struct {\n    Server   ServerConfig   `yaml:\"server\"`\n    Database DatabaseConfig `yaml:\"database\"`\n}\n\nfunc Load(path string) (*Config, error) {\n    data, err := os.ReadFile(path)\n    if err != nil {\n        return nil, fmt.Errorf(\"read config: %w\", err)\n    }\n    var cfg Config\n    if err := yaml.Unmarshal(data, &cfg); err != nil {\n        return nil, fmt.Errorf(\"parse config: %w\", err)\n    }\n    return &cfg, nil\n}\n```"

创建、测试与发布流程

# 1. 创建 Modelfile
cat > Modelfile.qwen-code << 'EOF'
FROM qwen2.5:7b-instruct-q8_0
SYSTEM "You are a code assistant."
PARAMETER temperature 0.3
EOF

# 2. 构建模型
curl http://omlx:11434/api/create -d '{
  "name": "qwen-code-assistant",
  "modelfile": "FROM qwen2.5:7b-instruct-q8_0\nSYSTEM \"You are a code assistant.\"\nPARAMETER temperature 0.3"
}'

# 或者用 ollama CLI
ollama create qwen-code-assistant -f Modelfile.qwen-code

# 3. 本地测试
ollama run qwen-code-assistant "Write a hello world in Rust"

# 4. 推送到 Registry
ollama push qwen-code-assistant

# 5. 在其他节点拉取
ollama pull qwen-code-assistant

14 OMLX 的多节点分布式推理是如何实现的？

答案：

OMLX 多节点分布式推理将模型计算分布在多个 GPU 节点上，通过网络互联协同完成单次推理请求，主要解决超大模型（100B+）超出单节点 GPU 容量的问题。

分布式推理拓扑

graph TD
    CLIENT["Client Request"] --> S["Server"] --> GW["Gateway"]
    GW --> CTRL["Controller<br/>调度分布式推理任务"]
    CTRL --> WA["Worker-A<br/>Node-1<br/>GPU: x8<br/>Layers 0-26"]
    CTRL --> WB["Worker-B<br/>Node-2<br/>GPU: x8<br/>Layers 27-53"]
    CTRL --> WC["Worker-C<br/>Node-3<br/>GPU: x8<br/>Layers 54-79"]
    WA <-->|"RDMA"| WB
    WB <-->|"RDMA"| WC
    WA & WB & WC --> NET["高速网络（InfiniBand / RoCE）<br/>Pipeline Parallel 跨节点"]

分布式并行策略

策略	跨节点通信模式	通信量	适用场景
跨节点 Tensor Parallel	AllReduce 每层	极高（每层 AllReduce）	不推荐（通信瓶颈）
跨节点 Pipeline Parallel	层间激活值传递	中等	超大模型（200B+）
跨节点 Expert Parallel	路由分发	低	MoE 模型

Pipeline Parallel 跨节点流程

Step 1: Worker-A (Node-1) 计算 Layers 0-26
  → 输出激活值 → RDMA 传输 → Worker-B (Node-2)
Step 2: Worker-B (Node-2) 计算 Layers 27-53
  → 输出激活值 → RDMA 传输 → Worker-C (Node-3)
Step 3: Worker-C (Node-3) 计算 Layers 54-79
  → 输出 Logits → 采样 → Token → 返回 Gateway

Micro-Batch 调度：

gantt
    title Micro-Batch 流水线调度
    dateFormat X
    axisFormat %s
    section Node-1
    B1_L0-26    : 0, 1
    B2_L0-26    : 1, 2
    B3_L0-26    : 2, 3
    B4_L0-26    : 3, 4
    section Node-2
    B1_L27-53   : 1, 2
    B2_L27-53   : 2, 3
    B3_L27-53   : 3, 4
    B4_L27-53   : 4, 5
    section Node-3
    B1_L54-79   : 2, 3
    B2_L54-79   : 3, 4
    B3_L54-79   : 4, 5
    B4_L54-79   : 5, 6

通过 Micro-Batch 流水线填充，减少 GPU 空闲等待时间（Bubble）


**OMLX 分布式推理配置**

```yaml
# 分布式推理 Worker 配置
worker:
  distributed:
    enabled: true
    mode: "pipeline_parallel"       # pipeline_parallel / expert_parallel
    world_size: 8                   # 总 GPU 数（跨节点）
    pp_size: 4                      # Pipeline Parallel 大小（4 个节点）
    node_rank: 0                    # 当前节点在 pipeline 中的序号
    master_addr: "worker-0.omlx-headless.omlx-system.svc.cluster.local"
    master_port: 29500
    transport: "nccl"               # NCCL / Gloo
    network:
      interface: "eth0"
      protocol: "RoCE"              # InfiniBand / RoCE / TCP

  # 网络要求
  # - InfiniBand HDR (200 Gbps) 或 RoCE v2 (100 Gbps)
  # - 节点间延迟 < 10µs
  # - 启用 GPU Direct RDMA (GDR)

关键网络要求

网络类型	带宽	延迟	适用节点数
InfiniBand HDR	200 Gbps	< 2µs	4+
InfiniBand EDR	100 Gbps	< 3µs	2-4
RoCE v2	100 Gbps	< 5µs	2-4
TCP/IP	25-100 Gbps	> 20µs	不推荐

OMLX 建议仅在超大模型（100B+ 参数）或多节点总算力受限时使用跨节点分布式推理。同等算力下应优先通过垂直扩展（增加单节点 GPU 数量）满足需求，避免跨节点通信开销。

15 OMLX 的负载均衡与请求路由策略是什么？

答案：

OMLX 的负载均衡与请求路由由 Gateway 组件集中负责，支持多种路由策略，根据模型分布、Worker 负载、会话亲和性等因素进行决策。

路由决策流程

Gateway 收到请求
  │
  ├── 1. 解析 model 名称 → 查 Model-Worker 映射表（从 Controller 同步）
  │
  ├── 2. 过滤候选 Worker 列表
  │     ├── 模型已加载到显存 → 优先级最高
  │     ├── 模型已在本地缓存（未加载到显存）→ 优先级中
  │     └── 模型未缓存（需 Pull）→ 优先级低
  │
  ├── 3. 应用会话亲和性（Session Affinity）
  │     └── 同一 Session ID 的请求路由至同一 Worker（复用 KV Cache）
  │
  ├── 4. 应用标签过滤（Label Selector）
  │     └── 按请求指定的 omlx_worker_label 过滤 Worker
  │
  ├── 5. 负载均衡算法选择目标 Worker
  │
  ├── 6. 检查目标 Worker 并发上限
  │     ├── 未达上限 → 转发请求
  │     └── 已达上限 → 排队或回退至次优 Worker
  │
  └── 7. 转发请求 → 等待 Worker 响应

负载均衡算法

算法	配置值	原理	适用场景
最少加载	`least-loaded`	选择当前活跃请求数最少的 Worker	通用场景，默认策略
轮询	`round-robin`	依次轮询所有候选 Worker	Worker 算力均等
加权轮询	`weighted-round-robin`	按 Worker 权重比例分发	Worker 算力不均衡
最少延迟	`least-latency`	选择历史平均延迟最低的 Worker	延迟敏感场景
随机	`random`	随机选择	简单测试

会话亲和性配置

gateway:
  sessionAffinity:
    enabled: true
    mode: "header"                  # header / cookie
    headerName: "X-OMLX-Session-Id"
    ttl: "30m"                      # 亲和性保持时长
    fallbackOnUnavailable: true     # 目标 Worker 不可用时的回退策略

请求队列与过载保护

gateway:
  queue:
    enabled: true
    maxSize: 1000                   # 全局最大排队请求数
    perWorkerMaxSize: 100           # 单 Worker 最大排队请求数
    timeout: "60s"                  # 排队超时
    overflowPolicy: "reject"        # reject / drop-head / drop-tail

  rateLimit:
    enabled: true
    requestsPerSecond: 500          # 全局 QPS 限制
    burstSize: 100                  # 突发容量
    perModelRateLimit:              # 按模型粒度限制
      "qwen2.5:72b":
        requestsPerSecond: 50
      "qwen2.5:7b":
        requestsPerSecond: 200

路由表的动态更新

Controller 维护集群拓扑

Worker-0 (node-gpu-01)：[qwen2.5:72b(active), nomic-embed(idle)]
Worker-1 (node-gpu-02)：[qwen2.5:7b(active), codellama:7b(active)]
Worker-2 (node-gpu-03)：[qwen2.5:7b(active)]
Worker-3 (node-gpu-04)：[llama3:8b(keep-alive)]

↓ Controller 发生变更事件（模型加载 / 卸载 / Worker 上下线）→ 推送更新

Gateway 路由表更新：
  qwen2.5:72b → [Worker-0]
  qwen2.5:7b  → [Worker-1, Worker-2]  (least-loaded 选择)
  codellama:7b → [Worker-1]
  llama3:8b   → [Worker-3]
  nomic-embed → [Worker-0]

16 OMLX 的 Embedding API 与向量检索是如何实现的？

答案：

OMLX 完整兼容 Ollama 的 Embedding API（POST /api/embeddings），支持从文本生成向量表示，并可与向量数据库集成实现语义检索。

Embedding API 使用

# 单文本向量化
curl http://omlx:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "The sky is blue because of Rayleigh scattering."
}'

# 响应
{
  "embedding": [0.0123, -0.0456, 0.0789, ...],  // 768/1024/4096 维向量
  "model": "nomic-embed-text"
}

# 批量向量化（OMLX 扩展）
curl http://omlx:11434/api/omlx/embeddings/batch -d '{
  "model": "nomic-embed-text",
  "input": [
    "What is Kubernetes?",
    "How to deploy a Pod?",
    "Explain container orchestration."
  ]
}'

常用 Embedding 模型

模型	向量维度	最大 Token	适用场景
nomic-embed-text	768	8192	通用英文/多语言文本向量化
mxbai-embed-large	1024	512	高精度语义检索
bge-m3	1024	8192	多语言检索、稠密+稀疏混合
all-minilm	384	512	轻量级快速检索
snowflake-arctic-embed	1024	8192	长文档向量化

OMLX Embedding 部署架构

graph LR
    subgraph EMB["Embedding 部署架构"]
        APP["Application<br/>(RAG / Search)"]
        OMLX["OMLX Server<br/>/api/embeddings<br/>nomic-embed<br/>(独立 Worker)"]
        VDB["向量数据库<br/>(Milvus/Qdrant/Weaviate)"]
        LOGIC["应用逻辑<br/>Query -> Embedding<br/>-> Search -> LLM"]
        APP --> OMLX
        APP --> VDB
        APP --> LOGIC
    end

Embedding 模型独立部署

Embedding 模型与生成模型在 OMLX 中可部署在不同的 Worker 节点上，避免竞争生成模型的 GPU 资源：

# 专用 Embedding Worker
worker:
  embedding:
    dedicated: true
    replicas: 2
    models:
      - "nomic-embed-text"
      - "bge-m3"
    resources:
      limits:
        nvidia.com/gpu: 1       # 仅需少量显存（1-2GB）
    config:
      keepAlive: "24h"           # Embedding 模型始终驻留显存

RAG Pipeline 示例

import ollama
import numpy as np
from qdrant_client import QdrantClient

client = ollama.Client(host="http://omlx:11434")

# 1. 文档向量化
def embed_documents(docs: list[str]) -> list[list[float]]:
    embeddings = []
    for doc in docs:
        resp = client.embeddings(model="nomic-embed-text", prompt=doc)
        embeddings.append(resp["embedding"])
    return embeddings

# 2. 存入 Qdrant 向量数据库
qdrant = QdrantClient(host="qdrant.example.com")
qdrant.upsert(
    collection_name="knowledge_base",
    points=[
        {"id": i, "vector": emb, "payload": {"text": doc}}
        for i, (doc, emb) in enumerate(zip(docs, embeddings))
    ]
)

# 3. 检索增强生成（RAG）
def rag_query(question: str) -> str:
    # 查询向量化
    q_emb = client.embeddings(model="nomic-embed-text", prompt=question)["embedding"]
    # 向量检索
    results = qdrant.search(collection_name="knowledge_base", query_vector=q_emb, limit=3)
    # 拼接上下文
    context = "\n".join([r.payload["text"] for r in results])
    # LLM 生成
    resp = client.chat(model="qwen2.5:7b", messages=[
        {"role": "system", "content": f"Answer based on context:\n{context}"},
        {"role": "user", "content": question}
    ])
    return resp["message"]["content"]

17 OMLX 的 Vision 模型支持（多模态推理）是如何实现的？

答案：

OMLX 通过兼容 Ollama 的多模态 API 支持 Vision 模型，可在 Chat / Generate 请求中传入 Base64 编码的图片，模型同时理解文本和图像内容。

支持的多模态模型

模型	视觉编码器	支持的分辨率	GGUF 可用	HuggingFace 可用
LLaVA 1.6	CLIP-ViT-L	336×336 / 672×672	是	是
Qwen2-VL	ViT-bigG	任意分辨率	部分	是
InternVL2	InternViT	动态分辨率	是	是
MiniCPM-V	SigLIP	448×448	是	是
Phi-3.5-Vision	CLIP	336×336	是	是

Vision API 调用

# 单张图片推理
curl http://omlx:11434/api/chat -d '{
  "model": "llava:13b",
  "messages": [
    {
      "role": "user",
      "content": "Describe this image in detail.",
      "images": ["'$(base64 -i image.jpg)'"]
    }
  ],
  "stream": false
}'

# 多张图片推理
curl http://omlx:11434/api/generate -d '{
  "model": "minicpm-v:8b",
  "prompt": "Compare these two charts. Which one shows better performance?",
  "images": ["'$(base64 -i chart1.png)'", "'$(base64 -i chart2.png)'"]
}'

# OMLX 扩展：图片 URL 直接传入
curl http://omlx:11434/api/omlx/chat -d '{
  "model": "qwen2-vl:7b",
  "messages": [
    {
      "role": "user",
      "content": "What is in this image?",
      "image_urls": ["https://example.com/image.jpg"]
    }
  ]
}'

Vision 模型显存需求

模型	语言部分显存（Q4_K_M）	视觉编码器显存	总分	推荐 GPU
LLaVA 7B	~4.5 GB	~1.5 GB	~6 GB	A10 / 3080
LLaVA 13B	~8 GB	~1.5 GB	~9.5 GB	A100-40G
Qwen2-VL 7B	~4.5 GB	~2 GB	~6.5 GB	A10 / 3090
InternVL2 8B	~5 GB	~3 GB	~8 GB	A100-40G
MiniCPM-V 8B	~5 GB	~1.5 GB	~6.5 GB	A10 / 3090

多模态推理流水线

graph TD
    Input["图片输入"]
    Preprocess["图片预处理<br/>Resize 到模型要求的输入尺寸<br/>Normalize 归一化<br/>转换为 Tensor"]
    VisionEncoder["视觉编码器<br/>（Vision Encoder / ViT）<br/>提取视觉特征（Visual Tokens / Image Embeddings）"]
    Projection["特征投影层<br/>（Projection Layer）<br/>将视觉特征映射到 LLM Embedding 空间"]
    LLM["大语言模型<br/>（LLM Backbone）<br/>文本 Token Embeddings + Image Embeddings<br/>Transformer 层处理多模态输入"]
    Output["文本解码输出"]

    Input --> Preprocess
    Preprocess --> VisionEncoder
    VisionEncoder --> Projection
    Projection --> LLM
    LLM --> Output

18 OMLX 的 Tool Calling / Function Calling 是如何实现的？

答案：

OMLX 兼容 Ollama 的 Tool Calling 功能，模型可输出结构化的函数调用请求，由调用方执行函数后将结果返回模型继续对话，实现 Agent 工作流。

Tool Calling 流程

graph TD
    Input["用户输入 + Tools 定义"]
    Analysis["OMLX → 模型分析 → 输出 Tool Call（JSON）"]
    Execute["调用方解析 Tool Call → 执行函数 → 获取结果"]
    Result["Tool Result 回传 OMLX"]
    Final["模型根据结果继续生成最终回复"]

    Input --> Analysis
    Analysis --> Execute
    Execute --> Result
    Result --> Final

API 调用示例

# 定义 Tools 并发起请求
curl http://omlx:11434/api/chat -d '{
  "model": "qwen2.5:14b",
  "messages": [
    {"role": "user", "content": "What is the weather in Beijing?"}
  ],
  "stream": false,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a given city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "City name, e.g. Beijing"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "default": "celsius"
            }
          },
          "required": ["city"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "get_stock_price",
        "description": "Get current stock price for a ticker",
        "parameters": {
          "type": "object",
          "properties": {
            "ticker": {
              "type": "string",
              "description": "Stock ticker symbol"
            }
          },
          "required": ["ticker"]
        }
      }
    }
  ]
}'

# 模型返回 Tool Call
{
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [
      {
        "function": {
          "name": "get_weather",
          "arguments": {
            "city": "Beijing",
            "unit": "celsius"
          }
        }
      }
    ]
  }
}

# 调用方执行函数后，将结果回传
curl http://omlx:11434/api/chat -d '{
  "model": "qwen2.5:14b",
  "messages": [
    {"role": "user", "content": "What is the weather in Beijing?"},
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [{"function": {"name": "get_weather", "arguments": {"city": "Beijing"}}}]
    },
    {"role": "tool", "content": "Beijing: 25°C, sunny, humidity 45%"}
  ],
  "stream": false
}'

Tool Calling 最佳模型

模型	Tool Calling 质量	备注
qwen2.5:14b-instruct	优秀	原生训练支持 Tool Use
qwen2.5:32b-instruct	优秀	多函数选择准确率高
llama3.1:8b	良好	需明确 System Prompt
mistral-nemo:12b	良好	函数调用格式稳定
command-r:35b	优秀	Cohere 原生的多步 Tool Use

OMLX 并行 Tool Call

支持一次返回多个 Tool Call，并行执行：

// 模型可能返回多个独立 Tool Call
{
  "tool_calls": [
    {"function": {"name": "get_weather", "arguments": {"city": "Beijing"}}},
    {"function": {"name": "get_weather", "arguments": {"city": "Shanghai"}}}
  ]
}
// 调用方可并行执行这两个函数，减少总延迟

Tool Call 超时与容错

# OMLX Gateway 配置
gateway:
  toolCalling:
    maxIterations: 10              # 最大 Tool Call 轮次
    toolExecutionTimeout: "30s"    # 单次工具执行超时
    maxToolResultTokens: 4096      # 工具返回结果最大 Token

19 OMLX 的流式输出（Streaming Response / SSE）是如何实现的？

答案：

OMLX 的流式输出兼容 Ollama 的 SSE（Server-Sent Events）格式，通过 stream: true 参数逐 Token 返回生成内容，减少首字节延迟（TTFB），提升用户交互体验。

SSE 流式响应格式

# 请求（stream: true）
curl http://omlx:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "Explain quantum computing in simple terms.",
  "stream": true
}'

# SSE 响应流
data: {"model":"qwen2.5:7b","created_at":"2025-01-01T00:00:00Z","response":"Quantum","done":false}
data: {"model":"qwen2.5:7b","created_at":"2025-01-01T00:00:00Z","response":" computing","done":false}
data: {"model":"qwen2.5:7b","created_at":"2025-01-01T00:00:00Z","response":" uses","done":false}
# ... 持续流式返回 ...
data: {"model":"qwen2.5:7b","created_at":"2025-01-01T00:00:00Z","response":"","done":true,"total_duration":3250000000,"load_duration":1200000000,"prompt_eval_count":12,"prompt_eval_duration":150000000,"eval_count":156,"eval_duration":2930000000}

Chat 流式响应

curl http://omlx:11434/api/chat -d '{
  "model": "qwen2.5:7b",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true
}'

# Chat 流式响应格式
data: {"message":{"role":"assistant","content":"Hello"},"done":false}
data: {"message":{"role":"assistant","content":"!"},"done":false}
data: {"message":{"role":"assistant","content":" How"},"done":false}
data: {"message":{"role":"assistant","content":" can"},"done":false}
data: {"message":{"role":"assistant","content":" I"},"done":false}
data: {"message":{"role":"assistant","content":" help"},"done":false}
data: {"message":{"role":"assistant","content":" you"},"done":false}
data: {"message":{"role":"assistant","content":" today"},"done":false}
data: {"message":{"role":"assistant","content":"?"},"done":false}
data: {"done":true,"total_duration":1850000000,"prompt_eval_count":10,"eval_count":9}

流式输出架构

graph TD
    Client["Client"]
    SSE["SSE Connection<br/>（HTTP Long-Poll）"]
    Server["OMLX Server"]
    GRPC1["gRPC Stream"]
    Gateway["OMLX Gateway"]
    GRPC2["gRPC Stream"]
    Worker["OMLX Worker"]
    Backend["推理后端（vLLM / llama.cpp）<br/>逐 Token 生成"]

    Client --> SSE
    SSE --> Server
    Server --> GRPC1
    GRPC1 --> Gateway
    Gateway --> GRPC2
    GRPC2 --> Worker
    Worker --> Backend

OMLX 的流式增强

增强特性	说明
断流重连	Worker 故障时自动切换到备用 Worker，从中断前的 Token 位置继续（需 KV Cache 转移）
背压控制	Gateway 检测 Client 消费速率低于生成速率时，暂缓 Worker 的 Token 生成
多路复用	同一 HTTP/2 连接承载多个 SSE 流，减少连接数
流式指标	每个 SSE 流的 TTFT / TPOT / Total Duration 记录在 Prometheus 指标中
流式超时	`stream_timeout` 配置项控制单个 SSE 连接的最大持续时间，默认 600s

非流式 vs 流式对比

维度	非流式（stream: false）	流式（stream: true）
首字节时间（TTFB）	等待全部 Token 生成完成	首个 Token 即可返回
客户端感知延迟	高（需等待全部生成）	低（逐 Token 展示）
网络传输	单次 JSON 响应	持续 SSE 事件流
连接占用	短连接	长连接（需配置超时）
适用场景	批处理 / API 调用 / 非交互	对话 UI / 实时交互

20 OMLX 的并发请求与吞吐量管理是如何实现的？

答案：

OMLX 通过 Worker 级别的并发控制、Dynamic Batching、请求排队和过载保护四层机制管理并发请求与系统吞吐量。

并发控制层级

graph TD
    L1["层级 1：Gateway 速率限制（Rate Limiting）<br/>全局 QPS 限制 + 单模型 QPS 限制<br/>超过限制 → HTTP 429 Too Many Requests"]
    L2["层级 2：Gateway 请求队列（Request Queuing）<br/>Worker 并发满时排队等待<br/>超过队列长度 / 超时 → HTTP 503 Service Unavailable"]
    L3["层级 3：Worker 并发上限（Concurrency Limit）<br/>单 Worker 最大并发请求数<br/>到达上限 → 回退至队列或拒绝"]
    L4["层级 4：推理后端批处理（Dynamic Batching）<br/>vLLM / SGLang 将并发请求合并为 Batch<br/>提升 GPU 利用率和吞吐量"]
    GPU["GPU 执行推理"]

    L1 --> L2
    L2 --> L3
    L3 --> L4
    L4 --> GPU

关键配置参数

gateway:
  concurrency:
    maxGlobalConcurrency: 500          # 全局最大并发请求
    defaultModelMaxConcurrency: 100    # 单模型默认最大并发
    modelOverrides:                    # 按模型粒度覆盖
      "qwen2.5:72b":
        maxConcurrency: 50
      "qwen2.5:7b":
        maxConcurrency: 200
    maxQueueSize: 2000
    queueTimeout: "120s"

worker:
  concurrency:
    maxRequestsPerWorker: 50           # 单 Worker 最大并发请求
    maxBatchSize: 32                   # 最大批处理大小
    maxWaitingRequests: 200            # 单 Worker 最大等待请求
    batchTimeout: "50ms"               # 批处理等待窗口（动态 batching 最大等待时间）

Dynamic Batching 机制

时间轴（每个槽 = 10ms）
────────────────────────────────────────────────────────────→

Request Arrival:
Req-1 ───[B1][B1][B1][B1][B1]────────────────────→
Req-2 ──────[B1][B1][B1][B1][B1]─────────────────→
Req-3 ──────────[B2][B2][B2][B2][B2]─────────────→
Req-4 ──────────[B2][B2][B2][B2][B2]─────────────→
Req-5 ───────────────────[B3][B3][B3][B3]────────→

Batch 合并：
Batch-1 = [Req-1, Req-2]    （在 batchTimeout 50ms 窗口内到达的请求合并）
Batch-2 = [Req-3, Req-4]    （下一批）
Batch-3 = [Req-5]            （窗口内只有一个请求）

吞吐量估算公式

理论吞吐量(tokens/s) = 
  min(
    模型推理速度(tokens/s/GPU) × GPU 数量 / TP_SIZE,
    Batch Size × Avg Output Tokens / (Prefill Time + Decode Time per Token × Avg Output Tokens)
  ) × GPU 利用率

实例（vLLM, qwen2.5:7b, 4×A10, TP=1, FIX=Batch 32）：
  推理速度 ≈ 150 tokens/s/GPU
  Throughput ≈ 150 × 4 × 0.85(利用率) = ~510 tokens/s

实例（vLLM, qwen2.5:72b, 2×A100-80G, TP=2）：
  推理速度 ≈ 45 tokens/s（2 卡合并为 1 组 TP）
  Throughput ≈ 45 × 0.85 = ~38 tokens/s

过载保护

保护机制	触发条件	行为
Gateway Rate Limit	QPS 超过限制	HTTP 429
Queue Overflow	队列长度超过 maxQueueSize	HTTP 503
Worker OOM Guard	显存使用 > 95%	拒绝新请求 + 触发模型卸载
Worker Timeout	单请求执行 > maxTimeout	终止推理 + HTTP 504
Shedding	Gateway CPU > 90%	随机舍弃低优先级请求

21 OMLX 的监控与指标（Prometheus / Grafana）是如何实现的？

答案：

OMLX 所有组件原生暴露 Prometheus Metrics 端点，通过 ServiceMonitor 自动接入 Prometheus Stack，内置 Grafana Dashboard 提供开箱即用的可视化。

核心 Prometheus 指标

Server 指标

指标名	类型	标签	说明
`omlx_server_requests_total`	Counter	model, endpoint, status	请求总数
`omlx_server_request_duration_seconds`	Histogram	model, endpoint	请求处理耗时
`omlx_server_active_connections`	Gauge	—	活跃 SSE 连接数
`omlx_server_requests_in_flight`	Gauge	model	处理中的请求数

Gateway 指标

指标名	类型	标签	说明
`omlx_gateway_route_decisions_total`	Counter	model, worker, strategy	路由决策统计
`omlx_gateway_queue_length`	Gauge	model	当前排队请求数
`omlx_gateway_queue_wait_seconds`	Histogram	model	排队等待时长
`omlx_gateway_rate_limited_total`	Counter	model, reason	被限流请求数

Worker 指标

指标名	类型	标签	说明
`omlx_worker_vram_used_bytes`	Gauge	worker, model	显存使用量
`omlx_worker_vram_total_bytes`	Gauge	worker	显存总容量
`omlx_worker_models_loaded`	Gauge	worker	已加载模型数
`omlx_worker_inference_tokens_total`	Counter	worker, model	生成 Token 总数
`omlx_worker_inference_duration_seconds`	Histogram	worker, model, phase(ttft/tpot)	推理耗时分布
`omlx_worker_request_queue_depth`	Gauge	worker	Worker 队列深度
`omlx_worker_gpu_utilization_percent`	Gauge	worker, gpu_id	GPU 利用率
`omlx_worker_gpu_memory_used_bytes`	Gauge	worker, gpu_id	GPU 显存使用量
`omlx_worker_model_load_duration_seconds`	Histogram	worker, model	模型加载耗时

Controller 指标

指标名	类型	标签	说明
`omlx_controller_workers_healthy`	Gauge	—	健康 Worker 数量
`omlx_controller_workers_total`	Gauge	—	注册 Worker 总数
`omlx_controller_rebalance_total`	Counter	reason	重平衡触发次数

Grafana Dashboard 面板

OMLX 内置 Grafana Dashboard 包含以下关键面板：

面板	指标	可视化类型
总体 QPS	`rate(omlx_server_requests_total[5m])`	Time Series
P50/P95/P99 延迟	`histogram_quantile(0.95, sum(rate(omlx_server_request_duration_seconds_bucket[5m])) by (le, model))`	Time Series
错误率	`sum(rate(omlx_server_requests_total{status=~“4..	5..”}[5m])) / sum(rate(omlx_server_requests_total[5m]))`
各模型 QPS	`sum(rate(omlx_server_requests_total[5m])) by (model)`	Bar Gauge
GPU 利用率	`avg(omlx_worker_gpu_utilization_percent) by (worker, gpu_id)`	Time Series
显存使用分布	`omlx_worker_vram_used_bytes / omlx_worker_vram_total_bytes`	Gauge
Token 生成速率	`rate(omlx_worker_inference_tokens_total[5m])`	Time Series
队列深度	`omlx_gateway_queue_length`	Time Series
Worker 健康状态	`omlx_controller_workers_healthy`	Stat
TTFT 分布	`histogram_quantile(0.50, rate(omlx_worker_inference_duration_seconds_bucket{phase="ttft"}[5m]))`	Time Series

监控集成配置

monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: "30s"
    labels:
      release: "prometheus"
  grafanaDashboard:
    enabled: true
    labels:
      grafana_dashboard: "1"

  # Alert Rules
  alerting:
    enabled: true
    rules:
      - alert: OMLXHighErrorRate
        expr: |
          sum(rate(omlx_server_requests_total{status=~"5.."}[5m]))
          / sum(rate(omlx_server_requests_total[5m])) > 0.05          
        for: 5m
        severity: critical
        annotations:
          summary: "OMLX error rate exceeds 5%"

      - alert: OMLXWorkerDown
        expr: omlx_controller_workers_healthy < omlx_controller_workers_total
        for: 2m
        severity: critical
        annotations:
          summary: "OMLX Worker node is down"

      - alert: OMLXVRAMHigh
        expr: omlx_worker_vram_used_bytes / omlx_worker_vram_total_bytes > 0.95
        for: 5m
        severity: warning
        annotations:
          summary: "OMLX VRAM usage exceeds 95%"

      - alert: OMLXQueueHigh
        expr: omlx_gateway_queue_length > 100
        for: 5m
        severity: warning
        annotations:
          summary: "OMLX request queue depth exceeds 100"

22 OMLX 的认证与访问控制（API Key / RBAC）是如何实现的？

答案：

OMLX 的认证与访问控制通过 API Key 认证 + RBAC 授权模型实现，支持静态密钥、OIDC 集成和基于角色的权限管理。

认证架构

graph TD
    Client["Client Request<br/>携带认证凭证<br/>Header: Authorization: Bearer omlx-xxx<br/>Header: Authorization: Bearer jwt_token<br/>Header: X-OMLX-API-Key: omlx-xxx"]
    Server["OMLX Server（认证中间件）<br/>验证 API Key → 查询本地 Key Store<br/>验证 JWT Token → 调用 OIDC Provider（Keycloak / Auth0）"]
    RBAC["RBAC 鉴权<br/>提取用户角色（admin / developer / viewer）<br/>检查角色是否有权限执行当前操作"]
    Gateway["转发至 Gateway"]

    Client --> Server
    Server --> RBAC
    RBAC --> Gateway

API Key 管理

auth:
  enabled: true
  apiKeys:
    - name: "admin-key"
      key: "omlx-admin-xxxxxxxxxxxx"
      role: "admin"
      description: "Full access key for automation"
    - name: "developer-key"
      key: "omlx-dev-xxxxxxxxxxxx"
      role: "developer"
      description: "Developer team shared key"
    - name: "viewer-key"
      key: "omlx-viewer-xxxxxxxxxxxx"
      role: "viewer"
      description: "Read-only access for monitoring"

  oidc:
    enabled: false
    issuer: "https://keycloak.example.com/realms/omlx"
    clientId: "omlx-server"
    clientSecret: "xxxxxxxx"
    redirectUri: "https://omlx.example.com/oauth/callback"
    scopes: ["openid", "profile", "email"]
    groupsClaim: "groups"

RBAC 角色权限矩阵

操作	admin	developer	viewer
Generate / Chat（推理）	允许	允许	允许
Embeddings（向量化）	允许	允许	允许
Pull Model（拉取模型）	允许	允许	禁止
Push Model（推送模型）	允许	允许	禁止
Create Model（创建模型）	允许	允许	禁止
Delete Model（删除模型）	允许	允许	禁止
List Models（列出模型）	允许	允许	允许
Show Model（查看详情）	允许	允许	允许
Get Metrics（查看指标）	允许	允许	允许
Worker Management（Worker 管理）	允许	禁止	禁止
API Key Management（密钥管理）	允许	禁止	禁止
Config Management（配置管理）	允许	禁止	禁止

模型级别的访问控制（OMLX 扩展）

auth:
  modelACLs:
    - model: "qwen2.5:72b"
      allowedRoles: ["admin", "developer"]
      rateLimit: 50                    # 该模型的额外 QPS 限制
    - model: "internal-fine-tuned"
      allowedRoles: ["admin"]          # 仅管理员可访问
    - model: "public-model"
      allowedRoles: ["admin", "developer", "viewer"]
      rateLimit: 200

JWT Token 验证流程

1. Client 通过 OIDC Provider 获取 JWT Token
2. 请求中携带 Authorization: Bearer <jwt_token>
3. OMLX Server 验证 JWT：
   ├── 验证签名（jwks_uri）
   ├── 验证 iss / aud / exp
   └── 提取 claims：sub, groups, email
4. 根据 groups 映射到 RBAC 角色
5. 检查请求操作权限

安全配置建议

配置项	建议值	说明
TLS	强制启用	生产环境必须加密传输，API Key 不通过明文 HTTP 发送
API Key 轮换	90 天	定期轮换 API Key，通过 Helm Values 更新
Rate Limit（未认证）	禁用	禁止未认证请求
API Key 最小化	按团队/服务分配独立 Key	便于审计和吊销
OIDC Token 有效期	< 24h	短有效期 Token + Refresh Token

23 OMLX 的日志与审计是如何实现的？

答案：

OMLX 的日志在组件层面以结构化 JSON 格式输出到 stdout，支持集成 Fluentd / Loki / Elasticsearch 等日志收集系统。审计功能记录所有管理操作（Pull / Push / Delete / Config 变更）形成审计事件流。

日志级别与类别

级别	用途	示例
DEBUG	开发调试、详细请求追踪	请求 payload、路由决策内部状态
INFO	正常运行状态	模型加载成功、Worker 心跳、请求计数
WARN	需关注但非紧急	模型卸载、显存接近上限、Worker 心跳延迟
ERROR	错误但系统可继续运行	推理失败、请求超时、单个 Worker 失联
FATAL	严重故障系统不可用	Controller Leader Election 失败、存储不可用

结构化日志格式

{
  "timestamp": "2025-06-15T10:30:00.123Z",
  "level": "INFO",
  "component": "worker",
  "worker_id": "worker-2",
  "trace_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "session_id": "sess-xyz",
  "message": "Inference completed",
  "fields": {
    "model": "qwen2.5:7b",
    "prompt_tokens": 128,
    "completion_tokens": 256,
    "duration_ms": 1250,
    "ttft_ms": 180,
    "tpot_ms": 45,
    "backend": "vllm",
    "gpu_id": 0
  }
}

审计事件类型

事件	触发条件	审计字段
`model.pull`	拉取模型	model, registry, size_bytes, duration_ms, user
`model.push`	推送模型	model, registry, size_bytes, user
`model.delete`	删除模型	model, user
`model.create`	创建模型	model, from_model, user
`model.load`	Worker 加载模型到显存	worker, model, vram_used_mb, duration_ms
`model.unload`	Worker 卸载模型	worker, model, reason
`api_key.create`	创建 API Key	key_name, role, creator
`api_key.delete`	删除 API Key	key_name, creator
`config.change`	修改配置	changed_keys, user
`worker.join`	Worker 注册到集群	worker_id, gpu_model, vram_total, backends
`worker.leave`	Worker 离开集群	worker_id, reason

日志集成配置

logging:
  format: "json"                    # json / text
  level: "info"                     # debug / info / warn / error / fatal
  output: "stdout"                  # stdout / file
  filePath: "/var/log/omlx/"

  # 请求日志
  accessLog:
    enabled: true
    excludePaths: ["/health", "/metrics"]

  # 审计日志
  auditLog:
    enabled: true
    output: "stdout"                # 单独输出到不同位置便于收集
    filePath: "/var/log/omlx/audit/"

  # Trace 集成（OpenTelemetry）
  tracing:
    enabled: true
    endpoint: "http://tempo-distributor.tempo.svc:4317"
    protocol: "grpc"                # grpc / http
    sampleRate: 0.1

Loki 日志查询示例

# 查询特定模型的推理耗时
{component="worker", model="qwen2.5:72b"}
  | json
  | duration_ms > 5000

# 查询 Worker 故障
{component="controller"} |= "worker.leave"

# 查询错误请求
{component="server"} | json | level="ERROR"
  | logfmt
  | status_code >= 500

# 按模型统计 QPS 趋势
sum by (model) (
  rate({component="server"} | json | message="Inference completed" [5m])
)

日志保留策略

日志类型	保留时长	存储位置
应用日志（DEBUG/INFO）	7 天	Loki / Elasticsearch（热存储）
应用日志（WARN/ERROR）	30 天	Loki / Elasticsearch
审计日志	365 天	S3 / MinIO 归档
请求日志	30 天	Loki

24 OMLX 的 Context Window 管理与 Truncation 机制是如何实现的？

答案：

OMLX 的 Context Window 管理涵盖模型原生上下文长度限制、请求级上下文截断策略和 KV Cache 管理三个层面。

Context Window 配置

# Modelfile 中设置
FROM qwen2.5:7b
PARAMETER num_ctx 8192           # 上下文窗口 Token 数

# API 请求中动态设置（不得超过模型原生上限）
curl http://omlx:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "...",
  "options": {
    "num_ctx": 4096              # 此请求使用 4096 Token 上下文
  }
}'

常用模型默认 / 最大上下文长度

模型	默认 num_ctx	最大 num_ctx	备注
qwen2.5:7b	2048	131072	支持长上下文
qwen2.5:72b	2048	131072	支持长上下文
llama3.1:8b	2048	131072	支持长上下文
llama3.1:70b	2048	131072	支持长上下文
mistral:7b	2048	32768	—
deepseek-r1:7b	2048	131072	—

上下文超出时的处理策略

graph TD
    Calc["请求 Token 数 = system_prompt_tokens + messages_tokens + max_tokens"]
    Check{"检查：请求 Token 数 > num_ctx？"}
    Normal["正常推理"]
    Truncate{"触发截断策略"}
    S1["策略 1：Truncate Head（截断最早的消息）<br/>保留 System Prompt + 最近 N 条消息，丢弃较早消息<br/>适用：多轮对话（历史不重要）"]
    S2["策略 2：Truncate Middle（截断中间消息）<br/>保留 System Prompt + 最早 2 条 + 最近 N 条<br/>适用：需保留上下文起始信息"]
    S3["策略 3：Summarize（摘要压缩）<br/>对历史消息自动生成摘要替代原文<br/>适用：需保留全部上下文语义"]
    S4["策略 4：Reject（拒绝）<br/>返回错误：context length exceeded<br/>适用：严格场景"]

    Calc --> Check
    Check -->|"否"| Normal
    Check -->|"是"| Truncate
    Truncate -->|"策略 1"| S1
    Truncate -->|"策略 2"| S2
    Truncate -->|"策略 3"| S3
    Truncate -->|"策略 4"| S4

OMLX Truncation 配置

worker:
  context:
    defaultNumCtx: 4096              # 默认上下文长度
    maxNumCtx: 32768                 # 允许的最大上下文（受显存限制）
    truncation:
      strategy: "truncate-head"      # truncate-head / truncate-middle / summarize / reject
      keepSystemPrompt: true         # 始终保留 System Prompt
      minKeepMessages: 4             # 至少保留最近几条消息
      summarizationModel: "qwen2.5:7b"  # 摘要压缩使用的模型

KV Cache 显存占用估算

KV Cache 大小 ≈
  2 × num_layers × num_ctx × num_key_value_heads × head_dim × dtype_size

实例（llama3.1:8b, num_ctx=8192, FP16）：
  num_layers = 32
  num_key_value_heads = 8
  head_dim = 128
  dtype_size = 2 (FP16)

  KV Cache ≈ 2 × 32 × 8192 × 8 × 128 × 2
           ≈ 1,073,741,824 bytes
           ≈ 1.0 GB

实例（qwen2.5:72b, num_ctx=32768, FP16）：
  num_layers = 80
  num_key_value_heads = 8
  head_dim = 128

  KV Cache ≈ 2 × 80 × 32768 × 8 × 128 × 2
           ≈ 10,737,418,240 bytes
           ≈ 10.0 GB

KV Cache 显存预算规划

模型	权重显存（Q4_K_M）	KV Cache 显存（8K ctx）	KV Cache 显存（32K ctx）	总显存（8K ctx）
qwen2.5:7b	~4.5 GB	~0.8 GB	~3.2 GB	~5.3 GB
qwen2.5:14b	~8.5 GB	~1.3 GB	~5.2 GB	~9.8 GB
qwen2.5:72b	~45 GB	~2.5 GB	~10.0 GB	~47.5 GB
llama3.1:8b	~5 GB	~1.0 GB	~4.0 GB	~6.0 GB
llama3.1:70b	~40 GB	~2.5 GB	~10.0 GB	~42.5 GB

25 OMLX 与原生 Ollama 的功能对比是什么？

答案：

OMLX 在完整兼容 Ollama API 的基础上，在架构、性能、运维和扩展性方面进行了全面增强。

核心功能对比

维度	Ollama	OMLX
架构	单体进程（Server + Runner 一体）	微服务分离（Server / Gateway / Controller / Worker）
部署模式	单节点（单机 / 单容器）	单节点 / 多节点 / K8s 集群
API 兼容	标准 Ollama API	完全兼容 Ollama API + OMLX 扩展端点
推理后端	llama.cpp（内置）	llama.cpp / vLLM / SGLang / TensorRT-LLM
模型格式	GGUF	GGUF / HuggingFace safetensors
多模型并发	受限于单进程显存管理	Worker 池化，独立管理各自显存中的模型
多 GPU	有限支持（CUDA 设备指定）	DP / TP / PP，数据并行最高吞吐
分布式推理	不支持	支持跨节点 Pipeline Parallel
负载均衡	无（单节点）	Gateway 路由 + 多种负载均衡算法
请求队列	无	Gateway 级别请求排队 + 过载保护
认证鉴权	无内置	API Key / OIDC / RBAC / Model ACL
监控指标	无内置	Prometheus + Grafana Dashboard
审计日志	无内置	结构化审计事件流
K8s 部署	手动编写 YAML	Helm Chart 一键部署 + HPA + StatefulSet
模型分发	从 Ollama Registry Pull / Push	支持任意 OCI Registry + 私有 Registry 认证
高可用	无（单点故障）	Server/Gateway 多副本，Controller 故障自动恢复，Worker 冗余
水平扩展	手动多实例（需自建负载均衡）	HPA + 动态 Worker 注册/注销
显存管理	Keep-Alive 机制	Keep-Alive + LRU 卸载 + 显存水位保护 + Worker 级别独立管理

性能对比（单机 1×A100-80G）

场景	Ollama（llama.cpp）	OMLX（vLLM backend）	提升
qwen2.5:7b, batch=1	~95 tokens/s	~100 tokens/s	+5%
qwen2.5:7b, batch=32	~280 tokens/s	~380 tokens/s	+36%
qwen2.5:72b, batch=1	~25 tokens/s	~28 tokens/s	+12%
qwen2.5:7b, TTFT	~200ms	~80ms	-60%

vLLM 后端的 Continuous Batching 和 PagedAttention 在大批量并发场景下吞吐优势显著。

场景选型建议

场景	推荐方案	原因
个人开发 / 本地测试	Ollama	安装简单、零配置、资源占用小
小团队共享（< 10 人）	Ollama + 手动负载均衡	运维成本低，能满足基本需求
企业在线推理平台	OMLX	多后端、多节点、监控、认证、高可用
GPU 资源池化	OMLX	Worker 池化管理、多模型并发、精细显存管理
K8s 原生环境	OMLX	Helm Chart 部署、HPA、Prometheus 集成
严格合规环境	OMLX	审计日志、RBAC、API Key 管理

26 OMLX 与 vLLM 的对比是什么？

答案：

OMLX 与 vLLM 并非直接竞品，而是不同定位：vLLM 是推理引擎，OMLX 是推理平台。OMLX 本身可将 vLLM 作为底层推理后端使用。

定位对比

维度	vLLM	OMLX
产品定位	高性能 LLM 推理引擎	多后端推理管理与服务化平台
核心能力	PagedAttention、Continuous Batching、高吞吐推理	API 兼容层、多节点调度、模型管理、认证鉴权
API 标准	OpenAI Compatible API	Ollama Compatible API（+ OMLX 扩展）
部署模式	单实例推理服务	多组件分布式集群
模型管理	手动加载/卸载	Pull / Push / List / Copy / Delete + OCI 分发
多后端	自身作为推理后端	可集成 vLLM / llama.cpp / SGLang / TensorRT-LLM
水平扩展	手动启动多实例 + 自建网关	Gateway 路由 + Worker 池 + HPA
多模型并发	需启动多个 vLLM 实例	单 Worker 可加载多模型，多 Worker 协同
监控鉴权	需外挂（如 LiteLLM Proxy）	内置 Prometheus + RBAC + 审计
K8s 部署	手动部署或第三方 Chart	官方 Helm Chart 一键部署

互补关系

OMLX 平台层
├── OMLX Server（Ollama API 兼容）
├── OMLX Gateway（路由 / 负载均衡）
├── OMLX Controller（调度 / 编排）
└── OMLX Worker（推理执行层）
      ├── vLLM Backend         ← 可将 vLLM 作为推理引擎
      ├── llama.cpp Backend
      ├── SGLang Backend
      └── TensorRT-LLM Backend

单独使用 vLLM：
  vLLM OpenAI Server → 需要额外搭建网关、负载均衡、认证、监控

使用 OMLX + vLLM：
  OMLX 管理运维层 + vLLM 提供推理能力 = 完整平台体验

选型建议

场景	推荐	原因
已有 OpenAI API 生态，单模型高吞吐	vLLM 直连	无需 Ollama 兼容层
现有 Ollama 客户端生态，需扩展至多节点	OMLX + vLLM Backend	保留 Ollama API + vLLM 性能
多模型管理 + 多租户平台	OMLX	内置多租户、认证、模型管理
仅需推理引擎，自建管理平台	vLLM	专注推理性能，管理面自研
K8s 全托管推理平台	OMLX	Helm Chart + 监控 + 自愈

27 OMLX 与 GPUStack 的对比是什么？

答案：

OMLX 与 GPUStack 都是开源 LLM 推理管理与服务化平台，核心差异在于 API 标准（Ollama vs OpenAI）、架构设计和管理粒度。

核心功能对比

维度	GPUStack	OMLX
API 标准	OpenAI Compatible API	Ollama Compatible API
架构	Server + Worker 两层	Server / Gateway / Controller / Worker 四层
推理后端	llama.cpp（内置），vLLM 可选	llama.cpp / vLLM / SGLang / TensorRT-LLM
模型管理	Web UI + CLI（`gpustack`）	Ollama CLI 兼容 + OCI Registry
GPU 共享	支持 GPU 共享（显存 + 算力）	GPU Worker 池化管理（非共享，独立分配）
多节点	Worker 注册 + Server 聚合	Controller 调度 + Gateway 路由
K8s 部署	Helm Chart 提供	Helm Chart 提供
监控	内置 Dashboard	Prometheus + Grafana
Modelfile	不支持	完整兼容 Ollama Modelfile
认证	API Key	API Key + OIDC + RBAC + Model ACL

架构对比

graph TD
    subgraph GS["GPUStack"]
        GSS["Server<br/>(OpenAI API)"]
        GSW["Worker<br/>(llama.cpp)"]
        GSS --> GSW
    end

    subgraph OMLX["OMLX"]
        OMLXS["Server<br/>(Ollama API)"]
        OMLXG["Gateway<br/>路由 / 负载均衡"]
        OMLXC["Controller<br/>调度 / 编排"]
        OMLXW["Worker<br/>(多后端)"]
        OMLXS --> OMLXG --> OMLXC --> OMLXW
    end

选型建议

场景	推荐	原因
已使用 Ollama，需扩展到多节点	OMLX	无缝迁移，保留 Ollama API / CLI / Modelfile
已使用 OpenAI SDK，需私有化部署	GPUStack	OpenAI Compatible API，生态集成简单
需要 GPU 共享（单卡跑多模型）	GPUStack	原生支持 GPU 显存/算力共享
需要多后端灵活切换	OMLX	vLLM/SGLang/TensorRT-LLM 多后端集成
需要完整审计 + RBAC	OMLX	内置企业级认证鉴权与审计
需要图形化 Web UI 管理	GPUStack	内置 Web UI Dashboard
多模型并发 + 高吞吐	OMLX	Worker 池 + Gateway 路由 + Dynamic Batching

28 OMLX 的性能优化机制有哪些？KV Cache / Prefix Caching / Batch Size 如何配置？

答案：

OMLX 的性能优化覆盖推理引擎层、调度层和资源管理层的多个维度，通过 KV Cache 管理、Prefix Caching、批处理调优和显存规划实现吞吐与延迟的最优平衡。

性能优化全景

层级                    优化手段                    关键参数
────────────────────────────────────────────────────────────
推理引擎层    PagedAttention / KV Cache 管理     max_num_batched_tokens
              Prefix Caching / Chunked Prefill   enable_prefix_caching
              Continuous Batching                max_num_seqs
                                                 
调度层        路由亲和性 / 会话保持               session_affinity_ttl
              请求优先级队列                      omlx_priority
              模型预加载                           precache_models
                                                 
资源管理层    显存池化 / 多 Worker 调度             vram_limit
              GPU 拓扑感知                          gpu_selector.topology
              动态 Batch Size                      max_batch_size

KV Cache 优化

worker:
  kv_cache:
    block_size: 16                  # KV Cache 块大小（Token 数），越大显存利用率越高
    gpu_memory_utilization: 0.90    # GPU 显存用于 KV Cache 的最大比例
    swap_space: 4                   # CPU 内存作为 KV Cache 换出空间（GB）
    max_num_seqs: 256               # 最大并发序列数（PagedAttention 管理）
    enable_prefix_caching: true     # 启用前缀缓存

Prefix Caching 工作原理

请求 A：System Prompt(200 tokens) + User Q1(50 tokens) + Assistant A1(100 tokens) 请求 B：System Prompt(200 tokens) + User Q2(50 tokens) + Assistant A2(100 tokens) 请求 C：System Prompt(200 tokens) + User Q3(50 tokens)

graph TD
    Prefix["相同前缀 System Prompt（200 tokens）"]
    Cache["Prefix Cache 命中 → 复用已计算的 KV Cache"]
    A["请求 A：Prefix(200t) 命中 → 仅计算 User Q1(50t) + 生成 A1"]
    B["请求 B：Prefix(200t) 命中 → 仅计算 User Q2(50t) + 生成 A2"]
    C["请求 C：Prefix(200t) 命中 → 仅计算 User Q3(50t) + 生成"]

    Prefix --> Cache
    Cache --> A
    Cache --> B
    Cache --> C

节省 Prefill 时间：

无 Prefix Cache：每个请求需计算 200 + 50 = 250 tokens Prefill
有 Prefix Cache：每个请求仅计算 50 tokens Prefill
Prefill 时间节省 ≈ 80%

Batch Size 调优

参数	含义	qwen2.5:7b 建议	qwen2.5:72b 建议	调优原则
`max_num_seqs`	最大并发序列数	128	64	设为 GPU 显存可支持的并发数的 0.8 倍
`max_num_batched_tokens`	单次迭代最大批处理 Token 数	8192	16384	与模型 max_position_embeddings 对齐
`max_model_len`	单个请求最大 Token 长度	8192	32768	低于此值的请求才接受
`gpu_memory_utilization`	GPU 显存利用率上限	0.90	0.85	72B 模型留更多余量给权重负载

批处理策略

Latency vs Throughput 权衡：

低延迟模式（交互式对话）：
  max_num_seqs: 32
  max_num_batched_tokens: 4096
  → 每个 Batch 处理少量短请求，TTFT 低

高吞吐模式（批量 API）：
  max_num_seqs: 256
  max_num_batched_tokens: 16384
  → 每个 Batch 合并大量请求，GPU 利用率高，但单个请求 TTFT 可能升高

自适应模式（OMLX 扩展）：
  adaptive_batching: true
  target_ttft_ms: 300
  → OMLX 根据 TTFT 目标自动调节 batch 大小

性能优化清单

优化项	预期收益	实施难度	适用场景
启用 Prefix Caching	Prefill 延迟降低 50-80%	低（配置开关）	多轮对话 / 相同 System Prompt
调整 GPU Memory Utilization	吞吐提升 10-20%	低	显存未充分利用
启用 Chunked Prefill	TTFT 降低 30-50%	低（配置开关）	长输入场景
选择 vLLM / SGLang 后端	批量吞吐提升 30-60%	中（模型格式转换）	高并发在线推理
数据并行多 Worker	吞吐线性扩展	低（增加 Worker 副本）	读多写少
共享存储模型缓存	模型拉取延迟降低 80%	中（搭建 MinIO/CephFS）	多 Worker 频繁加载模型
RDMA 网络	跨节点推理延迟降低 50%	高（硬件 + 网络配置）	跨节点分布式推理

29 OMLX 的常见故障有哪些？如何排查？

答案：

OMLX 常见故障覆盖 Worker 失联、模型加载失败、推理超时、显存不足、Gateway 路由异常等场景，以下为典型故障的排查路径。

故障 1：Worker 离线（Unhealthy）

现象：Controller 日志中 Worker 状态变更为 unhealthy，Gateway 路由表中该 Worker 标记为不可用。

排查步骤：

检查 Worker 进程是否存活

kubectl get pods -n omlx-system -l component=worker
kubectl describe pod worker-2 -n omlx-system

检查 Worker 日志中的最后输出

kubectl logs worker-2 -n omlx-system --tail=100
# 关注：CUDA OOM / GPU Driver Error / Network Timeout

检查 GPU 驱动状态

kubectl exec worker-2 -n omlx-system -- nvidia-smi
# 关注：GPU 是否可见、驱动版本是否匹配

检查 Controller 与 Worker 的网络连通性

kubectl exec controller-0 -n omlx-system -- \
  curl -s worker-2.omlx-headless:11437/health

检查 Worker 注册状态

curl http://omlx-controller:11436/api/workers
# 查看 Worker 心跳时间是否过期

根因常见：GPU 驱动崩溃重启、NCCL 通信超时、Worker OOMKilled、节点网络分区。

故障 2：模型加载失败

现象：POST /api/generate 或 POST /api/chat 返回错误 model not found 或 failed to load model。

排查步骤：

确认模型是否已 Pull
```
curl http://omlx:11434/api/list
```

检查 Worker 本地是否有模型文件

kubectl exec worker-0 -n omlx-system -- ls -lh /var/lib/omlx/models/

检查 Registry 连通性（Pull 时）

curl -u user:pass https://harbor.example.com/v2/_catalog

检查显存是否足够加载模型

curl http://omlx:11434/api/omlx/vram/status

检查模型格式与后端是否匹配
- GGUF 文件 + vLLM 后端 = 不兼容
- HuggingFace safetensors + llama.cpp 后端 = 不兼容

故障 3：推理超时

现象：请求返回 HTTP 504 Gateway Timeout 或客户端侧超时。

排查步骤：

检查 Gateway 日志排队状态

kubectl logs -n omlx-system -l component=gateway | grep "queue"

检查 Worker 并发请求数

# Prometheus 指标
omlx_worker_requests_in_flight{worker="worker-0"}

检查推理耗时分布

# Grafana: OMLX Dashboard → P95 Latency by Model
# 对比正常时段与异常时段的 TTFT / TPOT

检查 GPU 利用率

kubectl exec worker-0 -n omlx-system -- nvidia-smi dmon -s pucv -c 10

常见根因：

请求积压：并发超过 Worker 上限，Gateway 队列堆积
长上下文：Prompt Token 过多导致 Prefill 时间过长
GPU 降频：GPU 温度过高触发降频保护
模型抢占：Keep-Alive 到期导致模型卸载，请求等待重新加载

故障 4：显存不足（VRAM OOM）

现象：Worker 日志中出现 CUDA out of memory，请求返回 500 错误。

排查步骤：

检查显存使用

curl http://omlx:11434/api/omlx/vram/status

检查是否有多余模型占用显存

# 列出已加载模型及其显存占用
curl http://omlx:11434/api/omlx/models/loaded

检查 KV Cache 配置

# 是否 gpu_memory_utilization 设置过高
# KV Cache 预估占用 = 2 × layers × heads × head_dim × num_ctx × dtype

解决方案：

手动卸载闲置模型：POST /api/omlx/unload
降低 num_ctx 减少 KV Cache 预留
降低 gpu_memory_utilization 至 0.85
将部分模型迁移到其他 Worker

故障 5：Gateway 路由异常

现象：请求被路由到错误的 Worker，或请求返回 no available worker。

排查步骤：

检查 Model → Worker 映射表

curl http://omlx-controller:11436/api/routes

验证 Gateway 从 Controller 同步的路由表是否最新

kubectl logs -n omlx-system -l component=gateway | grep "route_sync"

检查 Worker 标签是否匹配

# 请求中指定了 omlx_worker_label，但无 Worker 持有该标签
kubectl get pods -n omlx-system -l omrx-label=gpu-a100

通用排查工具清单

工具	用途
`kubectl logs`	查看各组件日志
`kubectl describe pod`	查看 Pod 事件（OOM / CrashLoop）
`nvidia-smi`	检查 GPU 状态、驱动版本、显存使用
Prometheus + Grafana	查看指标趋势、告警
`curl /health`	检查各组件的健康检查端点
`curl /api/omlx/vram/status`	检查显存分布
`curl /api/omlx/workers`	查看 Worker 列表与状态

30 OMLX 生产环境部署最佳实践有哪些？

答案：

OMLX 生产部署最佳实践涵盖集群规划、高可用设计、资源规划、安全加固、监控告警和运维流程。

集群规划

graph TD
    subgraph LB["负载均衡层"]
        INGRESS["Ingress Controller (Nginx / Traefik) x 2<br/>+ cert-manager 自动 TLS 证书"]
    end

    subgraph CONTROL["控制平面节点"]
        S["Server Deployment (Replicas: 3)"]
        GW["Gateway Deployment (Replicas: 2-3)"]
        CTRL["Controller Deployment (Replicas: 1 + standby)"]
    end

    subgraph WORKERS["推理 Worker 节点"]
        APOOL["A100 Pool: qwen2.5:72b"]
        A10POOL["A10 Pool: qwen2.5:7b, llava:13b"]
        EPool["Embedding Pool: nomic-embed-text"]
    end

    subgraph MID["中间件与存储"]
        HARBOR["Harbor（模型 Registry）"]
        MINIO["MinIO / CephFS（共享模型缓存）"]
        PROM["Prometheus + Grafana（监控）"]
        LOKI["Loki / Elasticsearch（日志）"]
    end

    LB --> CONTROL --> WORKERS --> MID

高可用设计

组件	高可用策略	故障恢复
Server	Deployment 3 副本 + HPA	单副本故障时 Service 自动摘除
Gateway	Deployment 2-3 副本 + HPA	新请求路由至健康副本
Controller	Leader Election + 1 备用	Leader 故障后 30s 内自动切换
Worker	StatefulSet N+1 冗余	N 冗余容量，故障 Worker 的流量重分配至其他 Worker
模型缓存	共享存储（MinIO/CephFS）	单节点故障后新 Worker 从共享存储加载模型

资源规划（参考值）

模型	量化	单 Worker GPU	Worker 副本数	支持并发	日均 Token 量级
qwen2.5:7b	Q4_K_M	1×A10 (24G)	4	~200	~500M
qwen2.5:72b	Q4_K_M	2×A100-80G (TP=2)	2	~50	~100M
llama3.1:8b	Q4_K_M	1×A10 (24G)	4	~200	~500M
nomic-embed	Q4_K_M	1×T4 (16G)	2	~100	~1B
llava:13b	Q4_K_M	1×A10 (24G)	2	~50	~100M

安全加固清单

配置项	操作	优先级
TLS 终端加密	Ingress 或 Server 层启用 TLS	P0
API Key 认证	启用 `auth.enabled: true`，禁用未认证访问	P0
API Key 最小权限	按团队/服务分配独立 Key，避免共享 admin Key	P0
私有 Registry	模型仅推送到私有 Harbor，不依赖公共 Registry	P1
网络隔离	Worker 仅暴露于集群内部，Server 仅对外暴露 11434 端口	P1
RBAC 模型访问控制	为敏感模型配置 `modelACLs`	P1
Secret 管理	API Key 存储于 Kubernetes Secret，不可明文写入 values.yaml	P1
Audit Log 保留	启用审计日志并归档至对象存储	P2

监控告警建议阈值

告警	触发条件	严重级别	通知渠道
Worker 离线	`workers_healthy < workers_total` 持续 2m	Critical	PagerDuty / OnCall
错误率 > 5%	`error_rate > 0.05` 持续 5m	Critical	PagerDuty / OnCall
P95 延迟 > 10s	`p95_latency > 10` 持续 5m	Warning	飞书 / Slack
显存使用 > 95%	`vram_used / vram_total > 0.95` 持续 5m	Warning	飞书 / Slack
队列深度 > 100	`queue_length > 100` 持续 5m	Warning	飞书 / Slack
GPU 温度 > 85°C	`gpu_temp > 85` 持续 1m	Warning	飞书 / Slack

运维流程

# 1. 日常巡检
curl http://omlx:11434/api/omlx/health       # 整体健康状态
curl http://omlx:11434/api/omlx/workers      # Worker 列表
curl http://omlx:11434/api/omlx/vram/status  # 显存分布

# 2. 模型上线
ollama pull qwen2.5:14b                       # 拉取新模型
ollama run qwen2.5:14b --test                 # 功能验证
# 更新 Helm values 添加模型到 preload 列表
helm upgrade omlx omlx/omlx-stack -f values.yaml

# 3. Worker 扩容
helm upgrade omlx omlx/omlx-stack \
  --set worker.replicas=6
# 或使用 kubectl scale（依赖 StatefulSet）
kubectl scale statefulset omlx-worker --replicas=6 -n omlx-system

# 4. 模型下线
curl -X DELETE http://omlx:11434/api/delete \
  -d '{"model": "deprecated-model"}'
# 从 preload 列表中移除
helm upgrade ...

# 5. 灰度发布（Canary）
# 部署新版 Worker StatefulSet（新版本标签）
# Gateway 按标签逐步将流量切至新版 Worker
# 验证无异常后全量切换

备份与灾备

备份对象	方案	频率
模型文件	Harbor 镜像仓库（主存储）+ 异地同步	每次 Push
自定义 Modelfile	Git 仓库版本管理	每次变更
Helm Values	Git 仓库 + GitOps（ArgoCD / FluxCD）	每次变更
审计日志	S3 / MinIO 归档	每日
Prometheus 指标	Thanos / VictoriaMetrics 长期存储	持续

Kubernetes 资源限制建议

# Server - 建议 Request = Limit 避免被驱逐
server:
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

# Worker - GPU 任务不要设置 CPU limit（避免 throttling）
worker:
  resources:
    requests:
      cpu: "8"
      memory: "32Gi"
    limits:
      nvidia.com/gpu: 1     # 仅限制 GPU，不设 CPU limit
      memory: "32Gi"        # 内存设置 limit 防止 OOM 波及节点

# Gateway - 适度超卖
gateway:
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "4"
      memory: "4Gi"

附录：OMLX 核心 API 端点速查表

端点	Method	兼容性	说明
`/api/generate`	POST	Ollama 兼容	文本生成
`/api/chat`	POST	Ollama 兼容	对话补全
`/api/embeddings`	POST	Ollama 兼容	文本向量化
`/api/pull`	POST	Ollama 兼容	拉取模型
`/api/push`	POST	Ollama 兼容	推送模型
`/api/list`	GET	Ollama 兼容	列出模型
`/api/show`	POST	Ollama 兼容	查看模型详情
`/api/copy`	POST	Ollama 兼容	复制模型
`/api/delete`	DELETE	Ollama 兼容	删除模型
`/api/tags`	GET	Ollama 兼容	列出模型标签
`/api/ps`	GET	Ollama 兼容（扩展）	列出运行中模型 + Worker 信息
`/api/version`	GET	Ollama 兼容	版本信息
`/health`	GET	OMLX 扩展	所有组件健康检查
`/api/omlx/workers`	GET	OMLX 扩展	Worker 列表与状态
`/api/omlx/vram/status`	GET	OMLX 扩展	显存使用分布
`/api/omlx/models/loaded`	GET	OMLX 扩展	已加载模型列表
`/api/omlx/unload`	POST	OMLX 扩展	手动卸载模型
`/api/omlx/cache/prune`	POST	OMLX 扩展	清理过期缓存
`/api/omlx/embeddings/batch`	POST	OMLX 扩展	批量向量化
`/metrics`	GET	OMLX 扩展	Prometheus 指标端点