return to posts

Running Vision LLMs for Quick Testing

Sep 16, 2024

InternVL2

This guide provides step-by-step instructions for deploying various InternVL2 models using Docker and LMDeploy. Each section provides the necessary Docker commands and setup instructions for running different versions of the model, either in AWQ or HF format, and using various backends.

Requirements

To use the following deployment commands, ensure that you meet the following prerequisites:

  • Docker: Install Docker on your system.
  • GPU: Ensure that you have access to a system with NVIDIA GPUs and the required CUDA drivers.
  • Hugging Face Token: Obtain a Hugging Face Hub token to authenticate your access to the models.

Before running any of the following commands, replace <your_hugging_face_hub_token> with your actual Hugging Face token, and PORT with actual port number.

InternVL2-2B-AWQ Model Deployment

The following command sets up and deploys the InternVL2-2B model in AWQ format using the TurboMind backend:

docker run -it --rm -d --gpus all \\
-v ~/.cache/huggingface:/root/.cache/huggingface \\
--env HUGGING_FACE_HUB_TOKEN="<your_hugging_face_hub_token>" \\
-p PORT:23333 \\
--ipc=host \\
openmmlab/lmdeploy:latest \\
bash -c "\\
pip install timm flash-attn && \\
lmdeploy serve api_server OpenGVLab/InternVL2-2B \\
--backend turbomind \\
--model-format awq \\
--quant-policy 4 \\
--cache-max-entry-count 0.1"
  • Model: InternVL2-2B
  • Backend: TurboMind
  • Model Format: AWQ
  • Quantization Policy: 4-bit quantization
  • Cache Max Entry Count: 0.1 (10% of cache used)

InternVL2-4B-HF Model Deployment

For deploying the InternVL2-4B model in Hugging Face format with the PyTorch backend, use the following command:

docker run -it --rm -d --gpus all \\
-v ~/.cache/huggingface:/root/.cache/huggingface \\
--env HUGGING_FACE_HUB_TOKEN="<your_hugging_face_hub_token>" \\
-p PORT:23333 \\
--ipc=host \\
openmmlab/lmdeploy:latest \\
bash -c "pip install timm flash-attn && \\
lmdeploy serve api_server OpenGVLab/InternVL2-4B \\
--backend pytorch \\
--model-format hf \\
--quant-policy 4 \\
--cache-max-entry-count 0.1"
  • Model: InternVL2-4B
  • Backend: PyTorch
  • Model Format: Hugging Face (HF)
  • Quantization Policy: 4-bit quantization
  • Cache Max Entry Count: 0.1

InternVL2-8B-AWQ Model Deployment

The following command sets up the InternVL2-8B model in AWQ format using the TurboMind backend:

docker run -it --rm -d --gpus all \\
-v ~/.cache/huggingface:/root/.cache/huggingface \\
--env HUGGING_FACE_HUB_TOKEN="<your_hugging_face_hub_token>" \\
-p PORT:23333 \\
--ipc=host \\
openmmlab/lmdeploy:latest \\
bash -c "pip install timm flash-attn && \\
lmdeploy serve api_server OpenGVLab/InternVL2-8B-AWQ \\
--backend turbomind \\
--model-format awq \\
--quant-policy 4 \\
--cache-max-entry-count 0.1"
  • Model: InternVL2-8B
  • Backend: TurboMind
  • Model Format: AWQ
  • Quantization Policy: 4-bit quantization
  • Cache Max Entry Count: 0.1

InternVL2-26B-AWQ Model Deployment

To deploy the InternVL2-26B model in AWQ format using the TurboMind backend, execute the following Docker command:

docker run -it --rm -d --gpus all \\
-v ~/.cache/huggingface:/root/.cache/huggingface \\
--env HUGGING_FACE_HUB_TOKEN=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 \\
-p PORT:23333 \\
--ipc=host \\
openmmlab/lmdeploy:latest \\
bash -c "pip install timm flash-attn && \\
lmdeploy serve api_server OpenGVLab/InternVL2-26B-AWQ \\
--backend turbomind \\
--model-format awq \\
--quant-policy 4 \\
--cache-max-entry-count 0.1"
  • Model: InternVL2-26B
  • Backend: TurboMind
  • Model Format: AWQ
  • Quantization Policy: 4-bit quantization
  • Cache Max Entry Count: 0.1

Key Parameters and Concepts

  1. Backend: Specifies the backend used to serve the model (e.g., TurboMind or PyTorch).
  1. Model Format: The format of the model, such as AWQ or Hugging Face.
  1. Quantization Policy: Determines the quantization level. In this case, a 4-bit quantization policy is applied for all models.
  1. Cache Max Entry Count: Sets the maximum cache usage. The value 0.1 represents using 10% of the cache for the models.

Each Docker command deploys the corresponding model on port 23333 and uses the specified backend and quantization policy. Be sure to adjust the PORT if needed based on your system requirements.

This guide serves as a reference for deploying different InternVL2 models using LMDeploy and Docker.