Deploying Llama 3.2 - A Breeze with vLLM and Kubernetes

Hosting large language models (LLMs) like Llama 3.2 presents unique challenges, requiring specialized infrastructure and deployment strategies. This is where vLLM comes in, a cutting-edge framework designed to optimize LLM deployment on Kubernetes, the industry-leading container orchestration platform.

This blog post demonstrates a practical guide to deploying Llama 3.2 on Google Cloud's GKE platform using vLLM, taking advantage of its powerful features like speculative decoding for enhanced performance. While we focus on Google Cloud, the underlying principles and concepts can be easily adapted for deployment on Amazon Web Services (AWS) and other cloud platforms.

Sharing this knowledge is crucial, as the complexity of hosting LLMs often presents a hurdle for developers and researchers. By showcasing a step-by-step process, we aim to empower you to unleash the potential of LLMs in your own projects.

Want to deploy your Llama 3.2 model with ease? Look no further than vLLM, a launch partner for Llama versions, and Kubernetes, a powerful container orchestration platform. vLLM offers state-of-the-art deployment optimization for LLMs, including speculative decoding for improved performance.

** New to LLMs and Llama?** Check out our blog post on Llama 3.2 - A Game Changer for Smaller, Smarter, and More Responsible LLMs.

Here's how you can deploy Llama 3.2 on Google Cloud's GKE platform using vLLM:

Install Prerequisites

If you haven't already, install the Google Cloud SDK and kubectl using the following commands:

sudo apt-get install google-cloud-cli-gke-gcloud-auth-plugin
sudo apt-get install kubectl

We follow Google Cloud's Recommendations**:** Google Cloud recommends using their Gemma model for Llama 3.2 deployment.

Set up Environment Variables

Before creating your cluster, you need to set the following environment variables:

export PROJECT_ID=<YOUR GCP PROJECT>
export CLUSTER_NAME=<YOUR CLUSTER NAME>
export REGION=<GCP REGION>
export INSTANCE_TYPE=g2-standard-8
export NUM_NODES=1
export ZONE=${REGION}-a
export HF_TOKEN=<YOUR HUGGINGFACE TOKEN>

Click here to obtain a Hugging Face token: Hugging Face

Create Your Kubernetes Cluster

Now you can create your cluster by running the Google Cloud commands below.

gcloud container clusters create ${CLUSTER_NAME} \
  --project=${PROJECT_ID} \
  --region=${REGION} \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --release-channel=rapid \
  --num-nodes=${NUM_NODES}

Create Your GPU Nodes

Here we are using NVIDIA’s L4 instances on Google Cloud. The instances are perfect for LLM deployments.

gcloud container node-pools create gpupool \
  --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
  --project=${PROJECT_ID} \
  --location=${REGION} \
  --node-locations=${ZONE} \
  --cluster=${CLUSTER_NAME} \
  --machine-type=${INSTANCE_TYPE} \
  --num-nodes=${NUM_NODES}

Set up Your Kubernetes Credentials

You can now set up your Kubernetes credentials using the following command:

gcloud container clusters get-credentials \
  ${CLUSTER_NAME} \
  --location=${REGION} \
  --project=${PROJECT_ID}

Create Your Llama3.2 specific Kubernetes Deployment and Service

Copy and paste the following code block in a file called vllm-llama3.2-3b-it.yml. We’ll dive into the relevant details in the following step.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-2-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama3-2-server
  template:
    metadata:
      labels:
        app: llama3-2-server
        ai.gke.io/model: llama3.2-3b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
        - name: inference-server
          image: vllm/vllm-openai:latest
          resources:
            requests:
              cpu: "2"
              memory: "10Gi"
              ephemeral-storage: "10Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "2"
              memory: "10Gi"
              ephemeral-storage: "10Gi"
              nvidia.com/gpu: "1"
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - --model=$(MODEL_ID)
            - --tensor-parallel-size=1
            - --max_model_len=8126
            - --api-key=token-abc123
          env:
            - name: MODEL_ID
              value: meta-llama/Llama-3.2-3B-Instruct
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
        cloud.google.com/gke-gpu-driver-version: latest
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llama3-2-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

Once you have the configuration file saved, you can now apply the deployment and service configuration. Kubernetes will now set up vLLM, download your Llama3.2 model, and start the server.

kubectl apply -f vllm-llama3.2-3b-it.yaml

Configuration Details

In the Kubernetes configuration, we define that each Kubernetes pod should kick off the vLLM LLM server with python3 -m vllm.entrypoints.openai.api_server. Here, we adopted vllm’s OpenAI server interface.

Additionally, we set the model, how many parallel tensors can be computed, the maximum token length, and a fake API key for OpenAI-compatible LLM API.

The model is defined through the environment variable MODEL_ID. The model_id should match the HuggingFace model id (the server will download the model on start-up).

command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=1
- --max_model_len=8126
- --api-key=token-abc123
env:
	- name: MODEL_ID
	  value: meta-llama/Llama-3.2-3B-Instruct

Follow along with the server setup

After a few seconds, you can follow the model download and vLLM setup by streaming the Kubernetes logs to your local machine.

kubectl logs -f -l app=llama3-2-server

You should see logs similar to the snippet below:

INFO 10-01 08:50:57 api_server.py:164] Multiprocessing frontend to use ipc
INFO 10-01 08:50:57 api_server.py:177] Started engine process with PID 21
INFO 10-01 08:51:02 llm_engine.py:226] Initializing an LLM engine
INFO 10-01 08:51:04 model_runner.py:1014] Starting to load model
INFO 10-01 08:51:04 weight_utils.py:242] Using model weights format
Loading safetensors checkpoint shards:   0% Completed | 0/2
Loading safetensors checkpoint shards:  50% Completed | 1/2
Loading safetensors checkpoint shards: 100% Completed | 2/2
...
INFO 10-01 08:51:59 launcher.py:19] Available routes are:
INFO 10-01 08:51:59 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
...
INFO 10-01 08:51:59 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Once you see the server being ready with Uvicorn running on http://0.0.0.0:8000, you can hop to the next step to connect to your new LLM endpoint.

Connect to your vLLM endpoint

Kubernetes allows you to forward ports. This is a good way to test your LLM setup, but nowhere near a production-ready setup. Therefore, please don’t use it in production hosting/inference scenarios.

kubectl port-forward service/llm-service 8000:8000

Test your Endpoint

Once you forwarded the LLM server ports to your local machine, open a new terminal and test the endpoint using the example prompt below. You’ll notice that the cURL request exactly mimics the request for OpenAI’s API. However, we swapped out the URL (pointing to our local host that passes the request to our Kubernetes cluster), used our fake token, and updated the model ID. Voila!

PROMPT=$(cat << EOF
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant excellent in writing poems.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Please write a poem about machine learning<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
EOF
)

curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer token-abc123" \
  -d @- <<EOF
{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "prompt": "${PROMPT}",
    "top_p": 1.0,
    "max_tokens": 256
}
EOF

You should get an API response similar to the one below:

{
  "id": "cmpl-b4d206f43c5b4e93bf525203ef78ced0",
  "object": "text_completion",
  "created": 1727798681,
  "model": "meta-llama/Llama-3.2-3B-Instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nIn silicon halls of thought and code,\nA machine learns, a mind overimposed,\nFrom data's vast and varied sea,\nA dynamo of wisdom, yet to be.\n\nNeural networks weave and intersect,\nA tapestry of probabilities to invest,\nEach thread of knowledge, a subtle art,\nA dance of inputs, a calculated heart.\n\nThe data flows, a relentless stream,\nThrough training loops, a path to redeem,\nThe model's growth, a cumulative might,\nAs each new example illuminates the light.\n\nThe algorithms whisper, \"Learn, refine, and thrive\",\nAs layers of complexity entwine and align,\nTo recognize and classify, to compute and see,\nThe universe's vast, mysterious decree.\n\nThrough convolutional flows, a hidden strength,\nThe image is transformed, the essence at length,\nObject detection, segmentation too,\nA glimpse of truth, a virtual view.\n\nBut what of bias and error's stain?\nThe data's reflection, the algorithm's pain?\nThe need to soften, to balance and adjust,\nTo avoid chains of assumptions lost.\n\nIn this iterative, creature-like birth,\nMachine learning marbles, some forth on this earth,\nA wizardry both eerie and divine,\nWhere guesses wait, precariously aligned.\n\nYet, as the loops un",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "total_tokens": 287,
    "completion_tokens": 256
  }
}

Here is the poem in full beauty:

In silicon halls of thought and code,
A machine learns, a mind overimposed,
From data's vast and varied sea,
A dynamo of wisdom, yet to be.

Neural networks weave and intersect,
A tapestry of probabilities to invest,
Each thread of knowledge, a subtle art,
A dance of inputs, a calculated heart.

The data flows, a relentless stream,
Through training loops, a path to redeem,
The model's growth, a cumulative might,
As each new example illuminates the light.

The algorithms whisper, \"Learn, refine, and thrive\",
As layers of complexity entwine and align,
To recognize and classify, to compute and see,
The universe's vast, mysterious decree.

Through convolutional flows, a hidden strength,
The image is transformed, the essence at length,
Object detection, segmentation too,
A glimpse of truth, a virtual view.

But what of bias and error's stain?
The data's reflection, the algorithm's pain?
The need to soften, to balance and adjust,
To avoid chains of assumptions lost.

In this iterative, creature-like birth,
Machine learning marbles, some forth on this earth,
A wizardry both eerie and divine,
Where guesses wait, precariously aligned.

Final step: Delete your cluster Be careful about having testing costs that are too high; therefore, once you are done testing, remove your cluster with all its nodes and instances using the command below.

gcloud container clusters delete ${CLUSTER_NAME} \
  --region=${REGION} \
  --project=${PROJECT_ID}

Conclusion

This guide has demonstrated the simplicity and efficiency of deploying Llama 3.2 with vLLM on Google Cloud's GKE platform. By leveraging vLLM's optimized framework and Kubernetes' robust infrastructure, you can effortlessly set up and access your own LLM endpoint. Remember, while we focused on Google Cloud, this approach is adaptable to other cloud providers like AWS. We encourage you to explore and adapt the steps for your cloud environment.

This blog post serves as a springboard for your LLM journey. As you delve deeper, we recommend exploring vLLM's advanced features like quantization, model sharding, and multi-GPU support to enhance your LLM deployment further and unlock its full potential.

By sharing this knowledge, we hope to foster a community of LLM enthusiasts and empower you to build innovative applications that leverage the power of this transformative technology.

Switch to Digits today

Experience accounting, reimagined.