Stained Glass Output Protection¶
Stained Glass Output Protection is a library and associated vLLM plugin for tokenwise encrypting messages generated by a large language model.
Deployment¶
Docker¶
The Stained Glass Output Protection docker image is built from the official vLLM image, and includes the installed plugin, and the alternative entrypoint to enable the plugin (including with the correct CLI arguments to automatically launch vLLM with Prompt Embeds support). This image exposes the same ports as the vLLM image.
Once the provided docker image is acquired, it can be run with the following command:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
--env "SG_REGISTRY_CONNECTION_SECRET=<secret>" \
-p 8000:8000 \
--ipc=host \
protopia-ai/stainedglass-inference-server:0.5.1-e6205f3 <or the tag you have> \
--model meta-llama/Meta-Llama-3.1-8B-Instruct
The resulting vLLM server can be available at http://localhost:8000/
, and will expose an OpenAI compatible API that accepts prompt embeds.
Any CLI arguments that are valid for vLLM can be passed to the container in the docker run
command, except for --enable-chunked-prefill
, due to incompatibility with --enable-prompt-embeds
which is implicitly enabled in the output protection plugin image.
You can also mount any volumes containing model weights. The above mounts the local user's Hugging Face cache directory to the container's Hugging Face cache directory, which is useful for models that are downloaded from Hugging Face.
Docker Compose¶
Alternatively, you can launch the Stained Glass Output Protection docker image using Docker Compose. The following example docker-compose.yml
file can be used:
---
services:
model-server:
image: stainedglass-inference-server:${TAG:-latest} # Specify the TAG via an environment variable, or manually set it
ports:
- "8000:8000"
volumes:
- "~/.cache/huggingface:/root/.cache/huggingface"
command: --model meta-llama/Meta-Llama-3.1-8B-Instruct
environment:
- HUGGING_FACE_HUB_TOKEN=<secret>
- SG_REGISTRY_CONNECTION_SECRET=<secret>
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 1m30s
timeout: 30s
retries: 5
start_period: 30s
Python Wheel (vLLM Server)¶
Stained Glass Output Protection is also available as a Python wheel, which can be installed in any Python (>=3.10) environment via pip
or uv
.
# pip
VLLM_USE_PRECOMPILED=1 pip install "stainedglass_output_protection-1.1.1-py312-none-any.whl[vllm]"
# uv
VLLM_USE_PRECOMPILED=1 uv install "stainedglass_output_protection-1.1.1-py312-none-any.whl[vllm]"
In either case, the vllm
extra will also install the associated version of the vLLM library. Using the VLLM_USE_PRECOMPILED=1
environment variable ensures that a pre-compiled vLLM wheel is used to reduce installation time, but this is technically optional. The wheel filename may vary based on the Python version and platform, so you may need to adjust the filename accordingly.
This installs the vLLM plugin, allowing vLLM to automatically load part of the Stained Glass Output Protection library when it starts. However, you must also use an alternative entrypoint. You can launch this alternative entrypoint with the following command:
export HUGGING_FACE_HUB_TOKEN=<secret>
export SG_REGISTRY_CONNECTION_SECRET=<secret>
python -m stainedglass_output_protection.vllm.entrypoint \
--no-enable-chunked-prefill \
--enable-prompt-embeds \
--model meta-llama/Meta-Llama-3.1-8B-Instruct
The resulting vLLM server can be available at http://localhost:8000/
, and will expose an OpenAI compatible API that accepts prompt embeds.
Any CLI arguments that are valid for vllm serve
can be passed to the container in this command, except for --enable-chunked-prefill
, which is not compatible with --enable-prompt-embeds
.
When launched this way, vLLM will automatically use all the available GPUs on the system. You can use the CUDA_VISIBLE_DEVICES
environment variable to limit the GPUs that vLLM uses.
Python Wheel (Client)¶
Stained Glass Output Protection is also available as a Python wheel, which can be installed in any Python (>=3.10) environment via pip
or uv
. Unlike on the server, clients do not need to install vLLM. The client library contains utilities for generating client keys, and decrypting responses from the server.
Usage¶
Using vLLM with Stained Glass Output Protection occurs in three phases: Generating client keys, submitting requests, and decrypting responses.
Generating Client Keys¶
The Output Protection plugin requires clients to send an x25517 public key in the request headers that has been base64 encoded. Clients must provide this themselves, and the stainedglass_output_protection
library provides a utilities for this.
from stainedglass_output_protection import encryption
client_private_key, client_public_key = encryption.generate_ephemeral_keypair()
headers = {"x-client-public-key": base64.b64encode(client_public_key.public_bytes_raw()).decode("utf-8")}
Submitting Requests¶
The Output Protection plugin is fully compatible with the OpenAI Python SDK. Instead of using client.completions.create
, or client.chat.completions.create
, you should use client.completions.with_raw_response.create
or client.chat.completions.with_raw_response.create
respectively. The server will return its public key (for decryption) in the response headers. Here's how to do it for /v1/completions
:
openai_client = openai.AsyncOpenAI(api_key="<API_KEY>", base_url="<vllm server url>/v1", default_headers=headers)
response = await openai_client.completions.with_raw_response.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="Please tell me about the history of Rome.",
stream_options=stream_options_type.ChatCompletionStreamOptionsParam(include_usage=True),
)
Note
If not using the OpenAI Python SDK, make sure that the x-client-public-key
header is included in the request.
Decrypting Responses¶
Decrypting responses requires first deriving a shared AES key using the client's private key and the server's public key, which is provided in the response headers. The stainedglass_output_protection
library provides utilities for this as well.
After a shared key is derived, the text of the responses can be decrypted using the decrypt_str
function. Here's how to do it for /v1/completions
:
server_public_key = x25519.X25519PublicKey.from_public_bytes(base64.b64decode(response.headers["x-server-public-key"]))
shared_key = encryption.derive_shared_aes_key(client_private_key, server_public_key)
completion = response.parse()
for choice in completion.choices:
print(encryption.decrypt_str(choice.text, shared_aes_key=shared_key))
Note
If not using the OpenAI Python SDK, make sure to parse the corresponding response header (x-server-public-key
) and use it to derive the shared AES key, as shown above.
Configuration¶
The Output Protection plugin can be configured via environment variables.
Environment Variable | Description |
---|---|
SG_REGISTRY_CONNECTION_SECRET |
The secret used for internal socket connections between processes in vLLM with Output Protection enabled. The value of this variable should be a string. |
HUGGING_FACE_HUB_TOKEN |
The Hugging Face Hub token used to authenticate with the Hugging Face Hub. This is required for downloading models from the Hugging Face Hub. This is a requirement of vLLM to download gated models from the Huggingface Hub. |