# Running Stained Glass Proxy via vLLM on Modal

This guide explains how to run the Stained Glass Proxy locally via Docker Compose while pointing to a [vLLM](https://docs.vllm.ai/en/latest/) server with [Stained Glass Output Protection](https://docs.protopia.ai/output-protection/latest/) hosted on [Modal](https://modal.com/docs), a serverless compute platform that lets you run Python code in the cloud without managing infrastructure. A Python test script is included to verify that the system is functioning end-to-end.

## Prerequisites

You need Python 3.10 or higher and a Modal account. Install the Modal CLI:

```bash
pip install modal
```


Then authenticate with:

```bash
modal token new
```

## Deploying vLLM with Output Protection on Modal

We will deploys a vLLM server on Modal using a GPU-backed container with the [RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8).

In [None]:

%%writefile vllm_modal_inference.py
from typing import Final
import modal

MODEL_NAME: Final[str] = "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8"
SERVED_MODEL_NAME: Final[str] = "meta-llama/Llama-3.1-8B-Instruct"
MODEL_REVISION: Final[str] = "12fd6884d2585dd4d020373e7f39f74507b31866" # pragma: allowlist secret
VLLM_PORT: Final[int] = 8000
N_GPU: Final[int] = 1
MINUTES: Final[int] = 60


# Use Docker Image from ECR. 
container_pull_secret = modal.Secret.from_name("container-secret")
vllm_image = (
    modal.Image.from_aws_ecr(
        "**********.dkr.ecr.us-east-1.amazonaws.com/protopia/stainedglass-inference-server:1.2.1-2e7c344-obfuscated",
        secret=container_pull_secret,
    ).env({
        "HF_HUB_ENABLE_HF_TRANSFER": "1", # faster model transfers
        "SG_REGISTRY_CONNECTION_SECRET": "some-madeup-secret"
    }).run_commands("ln -s /usr/bin/python3 /usr/bin/python").entrypoint([])
)

# Configure Cache Volumes
hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
app = modal.App("output-protected-vllm-inference")


@app.function(
    image=vllm_image,
    gpu=f"H100:{N_GPU}",
    scaledown_window=15 * MINUTES, # how long should we stay up with no requests?
    timeout=10 * MINUTES, # how long should we wait for container start?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
)
@modal.concurrent(max_inputs=32) # how many requests can one replica handle? tune carefully!
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES, requires_proxy_auth=True)
def serve():
    import subprocess

    cmd = [
        "python3", "-m", "stainedglass_output_protection.vllm.entrypoint",
        "--no-enable-chunked-prefill",
        "--enable-prompt-embeds",
        "--model", MODEL_NAME,
        "--revision", MODEL_REVISION,
        "--served-model-name", SERVED_MODEL_NAME,
        "--host", "0.0.0.0",
        "--port", str(VLLM_PORT),
    ]

    # assume multiple GPUs are for splitting up large matrix multiplications
    cmd += ["--tensor-parallel-size", str(N_GPU)]

    print("Launching vLLM with command:")
    print(" ".join(cmd))
    subprocess.Popen(" ".join(cmd), shell=True)

Overwriting vllm_modal_inference.py


In [None]:
%%capture output --no-display
!modal deploy vllm_modal_inference.py

In [None]:
# If you want to see the output from the above command.  Uncomment the line below.
# print(output.stdout) # pyright: ignore[reportUndefinedVariable]

## Test Modal Endpoints

With Proxy Auth Enabled, You will need to create [Proxy Auth Tokens](https://modal.com/docs/guide/webhook-proxy-auth) and set the environment variables to use downstream.

In [None]:
import os

os.environ["MODAL_KEY"] = "***********************"
os.environ["MODAL_SECRET"] = "***********************"

### Test Endpoints

Note:  You will need to change LLM_URL below to your modal deployment URL.

In [None]:
from typing import Final

import openai

LLM_URL: Final[str] = (
    "https://protopia--output-protected-vllm-inference-serve.modal.run/v1"
)
API_KEY: Final[str] = "dummy_key"
SERVED_MODEL_NAME: Final[str] = "meta-llama/Llama-3.1-8B-Instruct"
MODAL_KEY: Final[str] = os.environ.get("MODAL_KEY", "unknown")
MODAL_SECRET: Final[str] = os.environ.get("MODAL_SECRET", "unknown")
HEADERS: Final[dict[str, str]] = {
    "Modal-Key": MODAL_KEY,
    "Modal-Secret": MODAL_SECRET,
}

modal_client_with_auth = openai.OpenAI(
    api_key=API_KEY, base_url=LLM_URL, default_headers=HEADERS
)
modal_client_no_auth = openai.OpenAI(api_key=API_KEY, base_url=LLM_URL)

In [None]:
# This should thow an authentication error - missing credentials for proxy authorization
modal_client_no_auth.models.list()

AuthenticationError: modal-http: missing credentials for proxy authorization

In [None]:
# Client with auth headers is successful
print(modal_client_with_auth.models.list().model_dump_json(indent=4))

{
    "data": [
        {
            "id": "meta-llama/Llama-3.1-8B-Instruct",
            "created": 1757547295,
            "object": "model",
            "owned_by": "vllm",
            "root": "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8",
            "parent": null,
            "max_model_len": 131072,
            "permission": [
                {
                    "id": "modelperm-27708c79e5fe4251ad7a566fe15cf29b",
                    "object": "model_permission",
                    "created": 1757547295,
                    "allow_create_engine": false,
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_search_indices": false,
                    "allow_view": true,
                    "allow_fine_tuning": false,
                    "organization": "*",
                    "group": null,
                    "is_blocking": false
                }
            ]
        }
    ],
    "object": "list"
}


In [None]:
%%writefile docker-compose.modal.yaml
---
services:
  stainedglass:
    image: **********.dkr.ecr.us-east-1.amazonaws.com/protopia/stainedglass-proxy:1.12.1-5feefe2-obfuscated 
    environment:
      SGP_INFERENCE_SERVICE_HOST: https://protopia--output-protected-vllm-inference-serve.modal.run
      SGP_SGT_PATH: "/app/sgt_model.sgt"
      SGP_DEVICE: "cuda"
      SGP_MAX_NEW_TOKENS: 1000
      SGP_NUM_SGT_WORKERS: 1
      SGP_OUTPUT_DECRYPTION: "True"
      SGP_USE_AIOHTTP_FOR_UPSTREAM: "True"
      SGP_SGT_TORCH_DTYPE: "torch.bfloat16"
      SGP_SGT_NOISE_LAYER_ATTENTION: "flash_attention_2"
      SGP_RECONSTRUCTION_MAX_SEQUENCE_LENGTH: 2048
      SGP_RECONSTRUCTION_MAX_NUM_EMBEDDINGS: 2048
      SGP_RECONSTRUCTION_MAX_BATCH_SIZE: 512
      SGP_ALLOWED_HEADERS: "Modal-Key,Modal-Secret,x-server-public-key"
    ports:
      - "8666:8600"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['6']  # Add/or remove available GPU IDs
              capabilities: [gpu]
    healthcheck:
      test: curl --fail http://localhost:8600/livez || exit 1
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Overwriting docker-compose.modal.yaml


In [None]:
!docker compose -f docker-compose.modal.yaml up -d

[1A[1B[0G[?25l[+] Running 0/1
 [33m⠋[0m Network inference-providers_default  Creating                           [34m0.1s [0m
[?25h[1A[1A[0G[?25l[34m[+] Running 1/2[0m
 [32m✔[0m Network inference-providers_default           [32mCreated[0m                   [34m0.1s [0m
 [33m⠋[0m Container inference-providers-stainedglass-1  Starting                  [34m0.1s [0m
[?25h[1A[1A[1A[0G[?25l[+] Running 1/2
 [32m✔[0m Network inference-providers_default           [32mCreated[0m                   [34m0.1s [0m
 [33m⠙[0m Container inference-providers-stainedglass-1  Starting                  [34m0.2s [0m
[?25h[1A[1A[1A[0G[?25l[+] Running 1/2
 [32m✔[0m Network inference-providers_default           [32mCreated[0m                   [34m0.1s [0m
 [33m⠹[0m Container inference-providers-stainedglass-1  Starting                  [34m0.3s [0m
[?25h[1A[1A[1A[0G[?25l[+] Running 1/2
 [32m✔[0m Network inference-providers_default           [32mCr

In [None]:
!docker compose -f docker-compose.modal.yaml logs

[36mstainedglass-1  | [0mINFO 09-10 23:41:09 [__init__.py:241] Automatically detected platform cuda.
[36mstainedglass-1  | [0m2025-09-10 23:41:10 | uvicorn.error                            | INFO     | None                             | Started server process [1]
[36mstainedglass-1  | [0m2025-09-10 23:41:10 | uvicorn.error                            | INFO     | None                             | Waiting for application startup.
[36mstainedglass-1  | [0m2025-09-10 23:41:10 | stainedglass_proxy.dependencies          | INFO     | None                             | Initializing pre-run lifespan events.
[36mstainedglass-1  | [0m2025-09-10 23:41:10 | stainedglass_proxy.dependencies          | INFO     | None                             | Proxy settings: inference_service_host='https://protopia--output-protected-vllm-inference-serve.modal.run' sgt_path='/app/sgt_model.sgt' min_new_tokens=None seed=None temperature=0.3 top_p=0.2 top_k=5000 repetition_penalty=1.0 upstream_keep_alive_

In [None]:
PROXY_URL: Final[str] = "http://localhost:8666/v1"

sgt_client = openai.OpenAI(
    api_key=API_KEY, base_url=PROXY_URL, default_headers=HEADERS
)

response = sgt_client.chat.completions.create(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {
            "role": "assistant",
            "content": "The Los Angeles Dodgers won the World Series in 2020.",
        },
        {"role": "user", "content": "Where was it played?"},
    ],
)

print(response.choices[0].message.content)

The 2020 World Series was played at Globe Life Field in Arlington, Texas


In [None]:
!docker compose -f docker-compose.modal.yaml down

[1A[1B[0G[?25l[+] Running 0/1
 [33m⠋[0m Container inference-providers-stainedglass-1  Stopping                  [34m0.1s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 [33m⠙[0m Container inference-providers-stainedglass-1  Stopping                  [34m0.2s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 [33m⠹[0m Container inference-providers-stainedglass-1  Stopping                  [34m0.3s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 [33m⠸[0m Container inference-providers-stainedglass-1  Stopping                  [34m0.4s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 [33m⠼[0m Container inference-providers-stainedglass-1  Stopping                  [34m0.5s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 [33m⠴[0m Container inference-providers-stainedglass-1  Stopping                  [34m0.6s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 [33m⠦[0m Container inference-providers-stainedglass-1  Stopping                  [34m0.7s [0m
[?25h[1A[1A[0G[?25l[+] Runni

## Stopping the Modal Server
To stop your Modal app:
1. List apps:
```python
modal app list
```
2. Copy the App ID and stop it
```python
modal app stop <APP_ID>
```
3. Re-run to verify that is has stopped
```python
modal app list
```