Docker Compose¶

Prerequisites¶

Before deploying, ensure you have the following installed on your system:

Docker
Docker Compose
NVIDIA drivers and NVIDIA Container Toolkit (for GPU support)
Docker Images for Stained Glass Transform Proxy and Stained Glass Inference Server (powered by vLLM)
A valid Hugging Face Hub API token with access to meta-llama/Llama-3.1-8B-Instruct

Podman

Podman and Podman Compose are not supported and are known to be incompatible with this deployment. In particular, Podman Compose does not handle NVIDIA GPU configuration in a way that works with the setup described below. This guide assumes Docker and Docker Compose; other runtimes are unsupported.

Prepare Containers¶

Along with the Docker Compose file, you should be provided two images, one for the SGT Proxy and one for the Stained Glass Inference Server (powered by vLLM). These images should be stored in a container registry that your Docker instance can access.

As the container images are provided as tar.gz files, you will need to extract them and push them to your container registry. The first step is usually to use docker load to load the image into your local Docker daemon. You can then tag the image and push it to your container registry.

docker load -i sgt-proxy.tar.gz
docker tag stainedglass-proxy:0.19.2-c2efa8a <your-registry>/stainedglass-proxy:0.19.2-c2efa8a
docker push <your-registry>/stainedglass-proxy:0.19.2-c2efa8a

docker load -i llm-api.tar.gz
docker tag llm-api-with-vllm:0.1.0 <your-registry>/llm-api-with-vllm:0.1.0
docker push <your-registry>/llm-api-with-vllm:0.1.0

Details may vary depending on your container registry and Kubernetes cluster configuration.

The SGT Proxy container has a name like stainedglass-proxy:0.19.2-c2efa8a, where the bundled Stained Glass Transform model name may or may not be included (and may vary). Additionally the tag (including the version number and commit hash) may also vary.

The SGT LLM API container has a name like llm-api-with-vllm:0.1.0, where the tag (including the version number and commit hash) may vary.

Hugging Face Hub API Token¶

You will need a Hugging Face Hub API token to download the model weights for the Llama 3.1 8B model. For directions on how to obtain a Hugging Face Hub API token, see the Hugging Face Hub documentation for Authentication and User Access Tokens.

After obtaining your Hugging Face Hub API token, you must request access to the Llama 3.1 8B model from the Hugging Face Hub model card. Go to the model page on the Hugging Face Hub and click the "Request access" button. Once you have been granted access to the model, you can use your Hugging Face Hub API token to download the model weights.

Setting Up the Deployment¶

1. Prepare the Docker Compose File¶

Download the provided docker-compose.yml file, and make any adjustments as necessary.

Common Adjustments to the `docker-compose.yml` File:¶

Enable OpenTelemetry¶

# ... existing docker-compose ...
environment:
  ... existing SGT environment variables ...
  # OpenTelemetry settings
  SGP_ENABLE_OPENTELEMETRY: true

  OTEL_PYTHON_FASTAPI_EXCLUDED_URLS: /docs,/v1/health,/v1/models
  OTEL_SERVICE_NAME: stainedglass-proxy
  OTEL_TRACES_EXPORTER: otlp  # other options include: console
  OTEL_METRICS_EXPORTER: none  # the current version of SGP does not emit metrics
  OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4318
  OTEL_EXPORTER_OTLP_PROTOCOL: http/protobuf

2. Set Up Environment Variables¶

The Docker Compose file is configured to use some secret environment variables from the host system, such as HF_TOKEN (see Hugging Face Hub API Token). You can set these in a .env file in the same directory as your docker-compose.yml file, export them in your shell session, pass them directly in the command line, or (not recommended) hardcode them in the docker-compose.yml file.

Other environment variables, for the Stained Glass Transform Proxy, can be modified in the docker-compose.yml file. Defaults are provided, but you can adjust them as needed.

Running the Deployment¶

3. Start the Services¶

Run the following command to start the services in detached mode:

docker compose up -d

This will:

Start the stainedglass service, exposing port 8600
Start the vllm service with GPU support, running the Llama-3.1-8B-Instruct model

4. Verify the Services¶

To check running containers:

docker ps

To follow logs:

docker compose logs -f

5. Accessing the Services¶

Stained Glass Transform Proxy should be available at http://localhost:8600
VLLM API should be available at http://vllm:8000 (only from within the container network)

You can test your connection using SGT Proxy's built-in Swagger UI at the /docs endpoint: http://localhost:8600/docs.

Interacting with the Stained Glass Proxy API¶

Once you can connect to the Stained Glass Proxy service, you can interact with its REST API to perform inference (see the API Reference for more details). The REST API is OpenAI-compatible, so you can use tools such as OpenAI's client or LangChain to interact with the service. See Tutorials for examples of how to use the service.

Managing the Deployment¶

6. Stopping the Services¶

To stop all running containers:

docker compose down

To restart the services:

docker compose restart

7. Scaling and GPU Configuration¶

To allocate different GPUs, update device_ids in docker-compose.yml:

device_ids: ['0', '1']  # Use multiple GPUs

To check available GPUs:

nvidia-smi

Stained Glass Inference Server (powered by vLLM) will automatically use the specified GPUs. Additionally, vLLM supports tensor parallelism, by passing it as an argument defined in the docker compose file:

    ...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1']
              capabilities: [gpu]
    command: --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2
    ...

Troubleshooting¶

Service not starting? Check logs:

docker compose logs stainedglass
docker compose logs vllm

HF_TOKEN authentication errors? Ensure your environment variable is properly configured and the token has the required permissions (including requesting access to the Llama 3.1 model).
GPU issues? Ensure NVIDIA drivers and container toolkit are installed correctly.