Stained Glass Output Protection¶

Stained Glass Output Protection is a library and associated vLLM plugin for tokenwise encrypting messages generated by a large language model.

Deployment¶

Docker¶

The Stained Glass Output Protection docker image is built from the official vLLM image, and includes the installed plugin, and the alternative entrypoint to enable the plugin (including with the correct CLI arguments to automatically launch vLLM with Prompt Embeds support). This image exposes the same ports as the vLLM image.

Once the provided docker image is acquired, it can be run with the following command:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    --env "SG_REGISTRY_CONNECTION_SECRET=<secret>" \
    -p 8000:8000 \
    --ipc=host \
    protopia-ai/stainedglass-inference-server:0.5.1-e6205f3 <or the tag you have> \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct

The resulting vLLM server can be available at http://localhost:8000/, and will expose an OpenAI compatible API that accepts prompt embeds.

Any CLI arguments that are valid for vLLM can be passed to the container in the docker run command, except for --enable-chunked-prefill, due to incompatibility with --enable-prompt-embeds which is implicitly enabled in the output protection plugin image.

You can also mount any volumes containing model weights. The above mounts the local user's Hugging Face cache directory to the container's Hugging Face cache directory, which is useful for models that are downloaded from Hugging Face.

Docker Compose¶

Alternatively, you can launch the Stained Glass Output Protection docker image using Docker Compose. The following example docker-compose.yml file can be used:

---
services:
  model-server:
    image: stainedglass-inference-server:${TAG:-latest}  # Specify the TAG via an environment variable, or manually set it
    ports:
      - "8000:8000"
    volumes:
      - "~/.cache/huggingface:/root/.cache/huggingface"
    command: --model meta-llama/Meta-Llama-3.1-8B-Instruct
    environment:
      - HUGGING_FACE_HUB_TOKEN=<secret>
      - SG_REGISTRY_CONNECTION_SECRET=<secret>
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 1m30s
      timeout: 30s
      retries: 5
      start_period: 30s

Python Wheel (vLLM Server)¶

Stained Glass Output Protection is also available as a Python wheel, which can be installed in any Python (>=3.10) environment via pip or uv.

# pip
VLLM_USE_PRECOMPILED=1 pip install "stainedglass_output_protection-1.1.1-py312-none-any.whl[vllm]"

# uv
VLLM_USE_PRECOMPILED=1 uv install "stainedglass_output_protection-1.1.1-py312-none-any.whl[vllm]"

In either case, the vllm extra will also install the associated version of the vLLM library. Using the VLLM_USE_PRECOMPILED=1 environment variable ensures that a pre-compiled vLLM wheel is used to reduce installation time, but this is technically optional. The wheel filename may vary based on the Python version and platform, so you may need to adjust the filename accordingly.

This installs the vLLM plugin, allowing vLLM to automatically load part of the Stained Glass Output Protection library when it starts. However, you must also use an alternative entrypoint. You can launch this alternative entrypoint with the following command:

export HUGGING_FACE_HUB_TOKEN=<secret>
export SG_REGISTRY_CONNECTION_SECRET=<secret>
python -m stainedglass_output_protection.vllm.entrypoint \
    --no-enable-chunked-prefill \
    --enable-prompt-embeds \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct

The resulting vLLM server can be available at http://localhost:8000/, and will expose an OpenAI compatible API that accepts prompt embeds.

Any CLI arguments that are valid for vllm serve can be passed to the container in this command, except for --enable-chunked-prefill, which is not compatible with --enable-prompt-embeds.

When launched this way, vLLM will automatically use all the available GPUs on the system. You can use the CUDA_VISIBLE_DEVICES environment variable to limit the GPUs that vLLM uses.

Python Wheel (Client)¶

Stained Glass Output Protection is also available as a Python wheel, which can be installed in any Python (>=3.10) environment via pip or uv. Unlike on the server, clients do not need to install vLLM. The client library contains utilities for generating client keys, and decrypting responses from the server.

# pip
pip install stainedglass_output_protection-1.1.1-py312-none-any.whl

# uv
uv install stainedglass_output_protection-1.1.1-py312-none-any.whl

Usage¶

Using vLLM with Stained Glass Output Protection occurs in three phases: Generating client keys, submitting requests, and decrypting responses.

Generating Client Keys¶

The Output Protection plugin requires clients to send an x25517 public key in the request headers that has been base64 encoded. Clients must provide this themselves, and the stainedglass_output_protection library provides a utilities for this.

from stainedglass_output_protection import encryption
client_private_key, client_public_key = encryption.generate_ephemeral_keypair()
headers = {"x-client-public-key": base64.b64encode(client_public_key.public_bytes_raw()).decode("utf-8")}

Submitting Requests¶

The Output Protection plugin is fully compatible with the OpenAI Python SDK. Instead of using client.completions.create, or client.chat.completions.create, you should use client.completions.with_raw_response.create or client.chat.completions.with_raw_response.create respectively. The server will return its public key (for decryption) in the response headers. Here's how to do it for /v1/completions:

openai_client = openai.AsyncOpenAI(api_key="<API_KEY>", base_url="<vllm server url>/v1", default_headers=headers)

response = await openai_client.completions.with_raw_response.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="Please tell me about the history of Rome.",
    stream_options=stream_options_type.ChatCompletionStreamOptionsParam(include_usage=True),
)

Note

If not using the OpenAI Python SDK, make sure that the x-client-public-key header is included in the request.

Decrypting Responses¶

Decrypting responses requires first deriving a shared AES key using the client's private key and the server's public key, which is provided in the response headers. The stainedglass_output_protection library provides utilities for this as well.

After a shared key is derived, the text of the responses can be decrypted using the decrypt_str function. Here's how to do it for /v1/completions:

server_public_key = x25519.X25519PublicKey.from_public_bytes(base64.b64decode(response.headers["x-server-public-key"]))
shared_key = encryption.derive_shared_aes_key(client_private_key, server_public_key)

completion = response.parse()

for choice in completion.choices:
    print(encryption.decrypt_str(choice.text, shared_aes_key=shared_key))

Note

If not using the OpenAI Python SDK, make sure to parse the corresponding response header (x-server-public-key) and use it to derive the shared AES key, as shown above.

Configuration¶

The Output Protection plugin can be configured via environment variables.

Environment Variable	Description
`SG_REGISTRY_CONNECTION_SECRET`	The secret used for internal socket connections between processes in vLLM with Output Protection enabled. The value of this variable should be a string.
`HUGGING_FACE_HUB_TOKEN`	The Hugging Face Hub token used to authenticate with the Hugging Face Hub. This is required for downloading models from the Hugging Face Hub. This is a requirement of vLLM to download gated models from the Huggingface Hub.