docker 部署
更新时间:2025年3月21日 11:10
浏览:749
仓库地址:
https://hub.docker.com/r/openmmlab/lmdeploy/tags
docker 部署
框架会自动下载模型,但容易下载失败或中断,造成高动失败。
示例 - 部署多模态图像理解模型: OpenGVLab/InternVL2-40B
https://hf-mirror.com/OpenGVLab/InternVL2-40B
先下载模型 OpenGVLab/InternVL2-40B 到指定目录(/path/to/models)
# 预下载模型(百必须)
cd /path/to/models
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download OpenGVLab/InternVL2-40B
# 预拉取镜像(百必须)
docker pull openmmlab/lmdeploy:latest
# 启动 docker
docker run \
--gpus all \
--net host \
--shm-size 16g \
--name lmdeploy \
-v /path/to/models:/opt/lmdeploy \
-p 80:80 \
-d openmmlab/lmdeploy:latest \
lmdeploy serve api_server OpenGVLab/InternVL2-40B --server-port 80
- lmdeploy serve api_server 对外提供与 openai 兼容的接口
- OpenGVLab/InternVL2-40B 为模型名
- server-port 指定端口号
lmdeploy serve api_server 支持的参数
positional arguments:
参数名 | 说明 |
---|---|
model_path | The path of a model. it could be one of the following options: - i) a local directory path of a turbomind model which is converted by lmdeploy convert command or download from ii) and iii). - ii) the model_id of a lmdeploy-quantized model hosted inside a model repo on huggingface.co, such as “internlm/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc. - iii) the model_id of a model hosted inside a model repo on huggingface.co, such as “internlm/internlm-chat-7b”, “qwen/qwen-7b-chat “, “baichuan-inc/baichuan2-7b-chat” and so on. Type: str |
options:
参数名 | 说明 |
---|---|
-h, —help | show this help message and exit |
—server-name | SERVER_NAME Host ip for serving. Default: 0.0.0.0. Type: str |
—server-port | SERVER_PORT Server port. Default: 23333. Type: int |
—allow-origins | ALLOW_ORIGINS [ALLOW_ORIGINS …] A list of allowed origins for cors. Default: [‘*’]. Type: str |
—allow-credentials | Whether to allow credentials for cors. Default: False |
—allow-methods | ALLOW_METHODS [ALLOW_METHODS …] A list of allowed http methods for cors. Default: [‘*’]. Type: str |
—allow-headers | ALLOW_HEADERS [ALLOW_HEADERS …] A list of allowed http headers for cors. Default: [‘*’]. Type: str |
—backend | {pytorch,turbomind} Set the inference backend. Default: turbomind. Type: str |
—log-level | {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET} Set the log level. Default: ERROR. Type: str |
—api-keys | [API_KEYS …] Optional list of space separated API keys. Default: None. Type: str |
—ssl | Enable SSL. Requires OS Environment variables ‘SSL_KEYFILE’ and ‘SSL_CERTFILE’. Default: False |
—model-name | MODEL_NAME The name of the served model. It can be accessed by the RESTful API /v1/models . If it is not specified, model_path will be adopted. Default: None. Type: str |
—chat-template | CHAT_TEMPLATE A JSON file or string that specifies the chat template configuration. Please refer to https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification. Default: None. Type: str |
—revision | REVISION The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.. Type: str |
—download-dir | DOWNLOAD_DIR Directory to download and load the weights, default to the default cache directory of huggingface.. Type: str |
PyTorch engine arguments:
参数名 | 说明 |
---|---|
—adapters | [ADAPTERS …] Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple lora adapters. If only have one adapter, one can only input the path of the adapter.. Default: None. Type: str |
—device | {cuda,ascend} The device type of running. Default: cuda. Type: str |
—tp TP | GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int |
—session-len | SESSION_LEN The max session length of a sequence. Default: None. Type: int |
—max-batch-size | MAX_BATCH_SIZE Maximum batch size. Default: 128. Type: int |
—cache-max-entry-count | CACHE_MAX_ENTRY_COUNT The percentage of free gpu memory occupied by the k/v cache, excluding weights . Default: 0.8. Type: float |
—cache-block-seq-len | CACHE_BLOCK_SEQ_LEN The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int |
—enable-prefix-caching | Enable cache and match prefix. Default: False |
—max-prefill-token-num | MAX_PREFILL_TOKEN_NUM the max number of tokens per iteration during prefill. Default: 8192. Type: int |
TurboMind engine arguments:
参数名 | 说明 |
---|---|
—tp TP | GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int |
—session-len | SESSION_LEN The max session length of a sequence. Default: None. Type: int |
—max-batch-size | MAX_BATCH_SIZE Maximum batch size. Default: 128. Type: int |
—cache-max-entry-count | CACHE_MAX_ENTRY_COUNT The percentage of free gpu memory occupied by the k/v cache, excluding weights . Default: 0.8. Type: float |
—cache-block-seq-len | CACHE_BLOCK_SEQ_LEN The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int |
—enable-prefix-caching | Enable cache and match prefix. Default: False |
—max-prefill-token-num | MAX_PREFILL_TOKEN_NUM the max number of tokens per iteration during prefill. Default: 8192. Type: int |
—model-format | {hf,llama,awq,gptq} The format of input model. hf means hf_llama , llama means meta_llama , awq represents the quantized model by AWQ, and gptq refers to the quantized model by GPTQ. Default: None. Type: str |
—quant-policy | {0,4,8} Quantize kv or not. 0: no quant; 4: 4bit kv; 8: 8bit kv. Default: 0. Type: int |
—rope-scaling-factor | ROPE_SCALING_FACTOR Rope scaling factor. Default: 0.0. Type: float |
—num-tokens-per-iter | NUM_TOKENS_PER_ITER the number of tokens processed in a forward pass. Default: 0. Type: int |
—max-prefill-iters | MAX_PREFILL_ITERS the max number of forward passes in prefill stage. Default: 1. Type: int |
Vision model arguments:
参数名 | 说明 |
---|---|
—vision-max-batch-size | VISION_MAX_BATCH_SIZE the vision model batch size. Default: 1. Type: int |