Docker Compose¶

Prerequisites¶

Before deploying, ensure you have the following installed on your system:

Docker
Docker Compose
NVIDIA drivers and NVIDIA Container Toolkit (for GPU support)
Docker Images for Stained Glass Transform Proxy and Stained Glass Inference Server (powered by vLLM)
A valid Hugging Face Hub API token with access to meta-llama/Llama-3.1-8B-Instruct

Prepare Containers¶

Along with the Docker Compose file, you should be provided two images, one for the SGT Proxy and one for the Stained Glass Inference Server (powered by vLLM). These images should be stored in a container registry that your Docker instance can access.

As the container images are provided as tar.gz files, you will need to extract them and push them to your container registry. The first step is usually to use docker load to load the image into your local Docker daemon. You can then tag the image and push it to your container registry.

docker load -i sgt-proxy.tar.gz
docker tag stainedglass-proxy:0.19.2-c2efa8a <your-registry>/stainedglass-proxy:0.19.2-c2efa8a
docker push <your-registry>/stainedglass-proxy:0.19.2-c2efa8a

docker load -i llm-api.tar.gz
docker tag llm-api-with-vllm:0.1.0 <your-registry>/llm-api-with-vllm:0.1.0
docker push <your-registry>/llm-api-with-vllm:0.1.0

Details may vary depending on your container registry and Kubernetes cluster configuration.

The SGT Proxy container has a name like stainedglass-proxy:0.19.2-c2efa8a, where the bundled Stained Glass Transform model name may or may not be included (and may vary). Additionally the tag (including the version number and commit hash) may also vary.

The SGT LLM API container has a name like llm-api-with-vllm:0.1.0, where the tag (including the version number and commit hash) may vary.

Hugging Face Hub API Token¶

You will need a Hugging Face Hub API token to download the model weights for the Llama 3.1 8B model. For directions on how to obtain a Hugging Face Hub API token, see the Hugging Face Hub documentation for Authentication and User Access Tokens.

After obtaining your Hugging Face Hub API token, you must request access to the Llama 3.1 8B model from the Hugging Face Hub model card. Go to the model page on the Hugging Face Hub and click the "Request access" button. Once you have been granted access to the model, you can use your Hugging Face Hub API token to download the model weights.

Setting Up the Deployment¶

1. Prepare the Docker Compose File¶

Download the provided docker-compose.yml file, and make any adjustments as necessary.

2. Set Up Environment Variables¶

The Docker Compose file is configured to use some secret environment variables from the host system, such as HF_TOKEN (see Hugging Face Hub API Token). You can set these in a .env file in the same directory as your docker-compose.yml file, export them in your shell session, pass them directly in the command line, or (not recommended) hardcode them in the docker-compose.yml file.

Other environment variables, for the Stained Glass Transform Proxy, can be modified in the docker-compose.yml file. Defaults are provided, but you can adjust them as needed.

Running the Deployment¶

3. Start the Services¶

Run the following command to start the services in detached mode:

docker compose up -d

This will:

Start the stainedglass service, exposing port 8600
Start the vllm service with GPU support, running the Llama-3.1-8B-Instruct model

4. Verify the Services¶

To check running containers:

docker ps

To follow logs:

docker compose logs -f

5. Accessing the Services¶

Stained Glass Transform Proxy should be available at http://localhost:8600
VLLM API should be available at http://vllm:8000 (only from within the container network)

You can test your connection using SGT Proxy's built-in Swagger UI at the /docs endpoint: http://localhost:8600/docs.

Interacting with the Stained Glass Proxy API¶

Once you can connect to the Stained Glass Proxy service, you can interact with its REST API to perform inference (see the API Reference for more details). The REST API is OpenAI-compatible, so you can use tools such as OpenAI's client or LangChain to interact with the service. See Tutorials for examples of how to use the service.

Managing the Deployment¶

6. Stopping the Services¶

To stop all running containers:

docker compose down

To restart the services:

docker compose restart

7. Scaling and GPU Configuration¶

To allocate different GPUs, update device_ids in docker-compose.yml:

device_ids: ['0', '1']  # Use multiple GPUs

To check available GPUs:

nvidia-smi

Stained Glass Inference Server (powered by vLLM) will automatically use the specified GPUs. Additionally, vLLM supports tensor parallelism, by passing it as an argument defined in the docker compose file:

    ...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1']
              capabilities: [gpu]
    command: --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2
    ...

Troubleshooting¶

Service not starting? Check logs:

docker compose logs stainedglass
docker compose logs vllm

HF_TOKEN authentication errors? Ensure your environment variable is properly configured and the token has the required permissions (including requesting access to the Llama 3.1 model).
GPU issues? Ensure NVIDIA drivers and container toolkit are installed correctly.