docker 部署

更新时间:2025年3月21日 11:10 浏览:749

仓库地址:

https://hub.docker.com/r/openmmlab/lmdeploy/tags

docker 部署

框架会自动下载模型,但容易下载失败或中断,造成高动失败。

示例 - 部署多模态图像理解模型: OpenGVLab/InternVL2-40B
https://hf-mirror.com/OpenGVLab/InternVL2-40B

先下载模型 OpenGVLab/InternVL2-40B 到指定目录(/path/to/models)

# 预下载模型(百必须)
cd /path/to/models
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download OpenGVLab/InternVL2-40B

# 预拉取镜像(百必须)
docker pull openmmlab/lmdeploy:latest

# 启动 docker
docker run \
  --gpus all \
  --net host \
  --shm-size 16g \
  --name lmdeploy \
  -v /path/to/models:/opt/lmdeploy \
  -p 80:80 \
  -d openmmlab/lmdeploy:latest \
  lmdeploy serve api_server OpenGVLab/InternVL2-40B --server-port 80
  • lmdeploy serve api_server 对外提供与 openai 兼容的接口
  • OpenGVLab/InternVL2-40B 为模型名
  • server-port 指定端口号

lmdeploy serve api_server 支持的参数

positional arguments:

参数名 说明
model_path The path of a model. it could be one of the following options: - i) a local directory path of a turbomind model which is converted by lmdeploy convert command or download from ii) and iii). - ii) the model_id of a lmdeploy-quantized model hosted inside a model repo on huggingface.co, such as “internlm/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc. - iii) the model_id of a model hosted inside a model repo on huggingface.co, such as “internlm/internlm-chat-7b”, “qwen/qwen-7b-chat “, “baichuan-inc/baichuan2-7b-chat” and so on. Type: str

options:

参数名 说明
-h, —help show this help message and exit
—server-name SERVER_NAME Host ip for serving. Default: 0.0.0.0. Type: str
—server-port SERVER_PORT Server port. Default: 23333. Type: int
—allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS …] A list of allowed origins for cors. Default: [‘*’]. Type: str
—allow-credentials Whether to allow credentials for cors. Default: False
—allow-methods ALLOW_METHODS [ALLOW_METHODS …] A list of allowed http methods for cors. Default: [‘*’]. Type: str
—allow-headers ALLOW_HEADERS [ALLOW_HEADERS …] A list of allowed http headers for cors. Default: [‘*’]. Type: str
—backend {pytorch,turbomind} Set the inference backend. Default: turbomind. Type: str
—log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET} Set the log level. Default: ERROR. Type: str
—api-keys [API_KEYS …] Optional list of space separated API keys. Default: None. Type: str
—ssl Enable SSL. Requires OS Environment variables ‘SSL_KEYFILE’ and ‘SSL_CERTFILE’. Default: False
—model-name MODEL_NAME The name of the served model. It can be accessed by the RESTful API /v1/models. If it is not specified, model_path will be adopted. Default: None. Type: str
—chat-template CHAT_TEMPLATE A JSON file or string that specifies the chat template configuration. Please refer to https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification. Default: None. Type: str
—revision REVISION The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.. Type: str
—download-dir DOWNLOAD_DIR Directory to download and load the weights, default to the default cache directory of huggingface.. Type: str

PyTorch engine arguments:

参数名 说明
—adapters [ADAPTERS …] Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple lora adapters. If only have one adapter, one can only input the path of the adapter.. Default: None. Type: str
—device {cuda,ascend} The device type of running. Default: cuda. Type: str
—tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
—session-len SESSION_LEN The max session length of a sequence. Default: None. Type: int
—max-batch-size MAX_BATCH_SIZE Maximum batch size. Default: 128. Type: int
—cache-max-entry-count CACHE_MAX_ENTRY_COUNT The percentage of free gpu memory occupied by the k/v cache, excluding weights . Default: 0.8. Type: float
—cache-block-seq-len CACHE_BLOCK_SEQ_LEN The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int
—enable-prefix-caching Enable cache and match prefix. Default: False
—max-prefill-token-num MAX_PREFILL_TOKEN_NUM the max number of tokens per iteration during prefill. Default: 8192. Type: int

TurboMind engine arguments:

参数名 说明
—tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
—session-len SESSION_LEN The max session length of a sequence. Default: None. Type: int
—max-batch-size MAX_BATCH_SIZE Maximum batch size. Default: 128. Type: int
—cache-max-entry-count CACHE_MAX_ENTRY_COUNT The percentage of free gpu memory occupied by the k/v cache, excluding weights . Default: 0.8. Type: float
—cache-block-seq-len CACHE_BLOCK_SEQ_LEN The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int
—enable-prefix-caching Enable cache and match prefix. Default: False
—max-prefill-token-num MAX_PREFILL_TOKEN_NUM the max number of tokens per iteration during prefill. Default: 8192. Type: int
—model-format {hf,llama,awq,gptq} The format of input model. hf means hf_llama, llama means meta_llama, awq represents the quantized model by AWQ, and gptq refers to the quantized model by GPTQ. Default: None. Type: str
—quant-policy {0,4,8} Quantize kv or not. 0: no quant; 4: 4bit kv; 8: 8bit kv. Default: 0. Type: int
—rope-scaling-factor ROPE_SCALING_FACTOR Rope scaling factor. Default: 0.0. Type: float
—num-tokens-per-iter NUM_TOKENS_PER_ITER the number of tokens processed in a forward pass. Default: 0. Type: int
—max-prefill-iters MAX_PREFILL_ITERS the max number of forward passes in prefill stage. Default: 1. Type: int

Vision model arguments:

参数名 说明
—vision-max-batch-size VISION_MAX_BATCH_SIZE the vision model batch size. Default: 1. Type: int
导航