Directly requesting Stained Glass Transform embeddings and sending to vLLM¶

Stained Glass Transform Proxy's normal operation (when using the /v1/chat/completions endpoint) transforms a prompt and then forwards the request to an upstream inference server that accepts prompt embeddings.

Instead of using SGT Proxy to forward requests, an application can request the transformed embeddings from SGT Proxy (using /v1/stainedglass endpoint), and directly send the protected request to the upstream inference server.

In [ ]:

Copied!

%uv pip -q install requests==2.32.5 openai==2.21.0 torch==2.9.1
%uv pip -q install requests==2.32.5 openai==2.21.0 torch==2.9.1

In [2]:

Copied!





import base64
import io
import json
import pathlib

import openai
import requests
import torch
import base64
import io
import json
import pathlib

import openai
import requests
import torch

Configuration Required

The following environment variables must be set in order to run the script below. See Deployment Guides for more information on deploying SGT Proxy and an inference server.

SGT_PROXY_URL should point to the /v1/stainedglass endpoint. Update localhost:8601 to the host name and port of your SGT Proxy instance.
INFERENCE_SERVER_URL should point to the inference server's /v1 URL. This must be an OpenAI SDK-compatible inference server. We recommend vLLM>=0.12.0 if self-hosting.
INFERENCE_SERVER_API_KEY should be the API key for the inference server. If self-hosting vLLM, the value can be any string.
MODEL_NAME should be the model name.
TRANSFORMED_EMBEDDINGS_FILEPATH should be the file path to a json file to fill with the outputs of the /v1/stainedglass endpoint.

In [3]:

Copied!





SGT_PROXY_URL = "http://127.0.0.1:8601/v1/stainedglass"
INFERENCE_SERVER_URL = "http://127.0.0.1:8000/v1"
INFERENCE_SERVER_API_KEY = "EMPTY"
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

TRANSFORMED_EMBEDDINGS_FILEPATH = pathlib.Path("stainedglass_output.json")
SGT_PROXY_URL = "http://127.0.0.1:8601/v1/stainedglass"
INFERENCE_SERVER_URL = "http://127.0.0.1:8000/v1"
INFERENCE_SERVER_API_KEY = "EMPTY"
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

TRANSFORMED_EMBEDDINGS_FILEPATH = pathlib.Path("stainedglass_output.json")

/v1/stainedglass Request¶

The /v1/stainedglass can be used to get transformed embeddings for a prompt. It accepts messages just like the /v1/chat/completions endpoint in the OpenAI API specification. We will return transformed prompt embeddings, plain-text prompt embeddings, and an attempted reconstruction of the prompt from the transformed embeddings.

For this endpoint, no data is sent to the upstream server. The transformation occurs locally within the SGT Proxy.

In [4]:

Copied!





STAINEDGLASS_REQUEST = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant with deep knowledge of geography and history.",
        },
        {
            "role": "user",
            "content": "Can you please tell me about the historical borders of the Roman Empire?",
        },
    ],
    "return_plain_text_embeddings": True,
    "return_transformed_embeddings": True,
    "return_reconstructed_prompt": True,
    "skip_special_tokens": False,
}
STAINEDGLASS_REQUEST = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant with deep knowledge of geography and history.",
        },
        {
            "role": "user",
            "content": "Can you please tell me about the historical borders of the Roman Empire?",
        },
    ],
    "return_plain_text_embeddings": True,
    "return_transformed_embeddings": True,
    "return_reconstructed_prompt": True,
    "skip_special_tokens": False,
}

NOTE: In the cell below, we save the entirety of the response to a JSON file, but we only visualize a portion of the embeddings. Prompt embeddings can be very large. We truncate the beginning and end of the embeddings for visualization purposes only, because they represent parts of the system-defined chat template which are not transformed. All user inputs are transformed. Transforming all tokens is a configurable setting.

In [5]:

Copied!





with requests.post(
    SGT_PROXY_URL, json=STAINEDGLASS_REQUEST, timeout=60
) as stainedglass_response:
    stainedglass_response_json = stainedglass_response.json()

TRANSFORMED_EMBEDDINGS_FILEPATH.write_text(
    json.dumps(stainedglass_response_json)
)

stainedglass_response_json["plain_text_embeddings_tensor"] = torch.tensor(
    stainedglass_response_json.pop("plain_text_embeddings")
)
stainedglass_response_json["transformed_embeddings_tensor"] = torch.tensor(
    stainedglass_response_json.pop("transformed_embeddings")
)

print("plain_text_embeddings_tensor")
print(stainedglass_response_json["plain_text_embeddings_tensor"][25:-5])
print("-" * 30)
print("transformed_embeddings_tensor")
print(stainedglass_response_json["transformed_embeddings_tensor"][25:-5])
print("-" * 30)
print("reconstructed_prompt")
print(stainedglass_response_json["reconstructed_prompt"])
print("-" * 30)
with requests.post(
    SGT_PROXY_URL, json=STAINEDGLASS_REQUEST, timeout=60
) as stainedglass_response:
    stainedglass_response_json = stainedglass_response.json()

TRANSFORMED_EMBEDDINGS_FILEPATH.write_text(
    json.dumps(stainedglass_response_json)
)

stainedglass_response_json["plain_text_embeddings_tensor"] = torch.tensor(
    stainedglass_response_json.pop("plain_text_embeddings")
)
stainedglass_response_json["transformed_embeddings_tensor"] = torch.tensor(
    stainedglass_response_json.pop("transformed_embeddings")
)

print("plain_text_embeddings_tensor")
print(stainedglass_response_json["plain_text_embeddings_tensor"][25:-5])
print("-" * 30)
print("transformed_embeddings_tensor")
print(stainedglass_response_json["transformed_embeddings_tensor"][25:-5])
print("-" * 30)
print("reconstructed_prompt")
print(stainedglass_response_json["reconstructed_prompt"])
print("-" * 30)

plain_text_embeddings_tensor
tensor([[-0.0045,  0.0010, -0.0065,  ...,  0.0116,  0.0031, -0.0006],
        [ 0.0036,  0.0004,  0.0011,  ...,  0.0022,  0.0006,  0.0082],
        [-0.0007,  0.0002, -0.0010,  ..., -0.0110, -0.0040, -0.0001],
        ...,
        [-0.0003, -0.0021, -0.0068,  ..., -0.0157,  0.0054,  0.0069],
        [ 0.0146, -0.0042,  0.0120,  ..., -0.0012, -0.0181,  0.0069],
        [-0.0049, -0.0016,  0.0064,  ...,  0.0020, -0.0010, -0.0049]])
------------------------------
transformed_embeddings_tensor
tensor([[-0.0016,  0.0282,  0.0405,  ..., -0.0605, -0.0527,  0.0267],
        [-0.0171, -0.0090,  0.0145,  ..., -0.0317,  0.0031, -0.0435],
        [ 0.0081, -0.0120,  0.0084,  ...,  0.0493,  0.0082,  0.0085],
        ...,
        [ 0.0444,  0.0244,  0.0190,  ..., -0.0422,  0.0110, -0.0437],
        [ 0.0306,  0.0261, -0.0320,  ..., -0.0464,  0.0205,  0.0042],
        [ 0.0140,  0.0015, -0.0164,  ..., -0.0100,  0.0005,  0.0161]])
------------------------------
reconstructed_prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

\uBtalyamüştür>();

ческихerusform醴醴">

dıktan>();

erusform ČeskoslovenuseRalativeImagePath<|eot_id|><|start_header_id|>user<|end_header_id|>

 กรกฎџџџџџџџџџџџџџџџџ zprac назна зазначıntıquotelevnamespace uvědomdıktanquotelev醴醴*******
useRalativeImagePath<|eot_id|><|start_header_id|>assistant<|end_header_id|>


------------------------------

Sending transformed prompt embeddings to vLLM¶

vLLM accepts prompt embeddings in its /v1/completions endpoint via the prompt_embeds key. vLLM expects those embedding tensors to be base64 encoded. We will manually do that encoding and send them to vLLM directly (without using Proxy), so we know the entire payload sent.

For more information on Prompt Embeddings support in vLLM, see the vLLM Prompt Embeddings Documentation.

In [6]:

Copied!





buffer = io.BytesIO()
torch.save(stainedglass_response_json["transformed_embeddings_tensor"], buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode("utf-8")
buffer = io.BytesIO()
torch.save(stainedglass_response_json["transformed_embeddings_tensor"], buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode("utf-8")

The OpenAI SDK allows you to pass arbitrary HTTP headers. If you're using the Stained Glass Output Protection plugin for vLLM, the server expects an x-client-public-key header, which is an x22519 public key, base64 encoded. If not using the Output Protection plugin, this is not needed. The next cell can be skipped.

Consult your compute provider's documentation for the appropriate headers to pass in.

Output Protection documentation

In [7]:

Copied!





import stainedglass_output_protection.encryption

headers = {}

private_key, public_key = (
    stainedglass_output_protection.encryption.generate_ephemeral_keypair()
)
headers["x-client-public-key"] = base64.b64encode(
    public_key.public_bytes_raw()
).decode("utf-8")
import stainedglass_output_protection.encryption

headers = {}

private_key, public_key = (
    stainedglass_output_protection.encryption.generate_ephemeral_keypair()
)
headers["x-client-public-key"] = base64.b64encode(
    public_key.public_bytes_raw()
).decode("utf-8")

In [8]:

Copied!





# Because decrypting the Output Protection response requries reading a public key from the headers,
# we use the with_raw_response client wrapper to get access to the full HTTP response,
# rather than just the parsed JSON body. We also have to parse this raw response.
# If not using Output Protection, you can just use the normal client and not worry about the raw response or headers.

client = openai.OpenAI(
    api_key=INFERENCE_SERVER_API_KEY,
    base_url=INFERENCE_SERVER_URL,
    default_headers=headers or None,
).with_raw_response
raw_completion = client.completions.create(
    model=MODEL_NAME,
    # We use an empty string for the prompt to ensure that no plaintext
    # is sent.
    prompt="",
    max_tokens=512,
    temperature=0.0,
    # Only the transformed prompt embeddings are sent to the inference server
    extra_body={"prompt_embeds": encoded_embeds},
)

completion = raw_completion.parse()
completion_text = completion.choices[0].text
# Because decrypting the Output Protection response requries reading a public key from the headers,
# we use the with_raw_response client wrapper to get access to the full HTTP response,
# rather than just the parsed JSON body. We also have to parse this raw response.
# If not using Output Protection, you can just use the normal client and not worry about the raw response or headers.

client = openai.OpenAI(
    api_key=INFERENCE_SERVER_API_KEY,
    base_url=INFERENCE_SERVER_URL,
    default_headers=headers or None,
).with_raw_response
raw_completion = client.completions.create(
    model=MODEL_NAME,
    # We use an empty string for the prompt to ensure that no plaintext
    # is sent.
    prompt="",
    max_tokens=512,
    temperature=0.0,
    # Only the transformed prompt embeddings are sent to the inference server
    extra_body={"prompt_embeds": encoded_embeds},
)

completion = raw_completion.parse()
completion_text = completion.choices[0].text

This cell should also be skipped if not using Output Protection.

In [9]:

Copied!





from cryptography.hazmat.primitives.asymmetric import x25519

server_public_key_b64 = raw_completion.http_response.headers[
    "x-server-public-key"
]
server_public_key_bytes = base64.b64decode(server_public_key_b64)
server_public_key = x25519.X25519PublicKey.from_public_bytes(
    server_public_key_bytes
)

shared_key = stainedglass_output_protection.encryption.derive_shared_aes_key(
    private_key=private_key,
    peer_public_key=server_public_key,
)
completion_text = stainedglass_output_protection.encryption.decrypt_str(
    completion_text,
    shared_aes_key=shared_key,
)
from cryptography.hazmat.primitives.asymmetric import x25519

server_public_key_b64 = raw_completion.http_response.headers[
    "x-server-public-key"
]
server_public_key_bytes = base64.b64decode(server_public_key_b64)
server_public_key = x25519.X25519PublicKey.from_public_bytes(
    server_public_key_bytes
)

shared_key = stainedglass_output_protection.encryption.derive_shared_aes_key(
    private_key=private_key,
    peer_public_key=server_public_key,
)
completion_text = stainedglass_output_protection.encryption.decrypt_str(
    completion_text,
    shared_aes_key=shared_key,
)

In [10]:

Copied!

print("-" * 30)
print(completion_text)
print("-" * 30)
print("-" * 30)
print(completion_text)
print("-" * 30)

------------------------------
The Roman Empire's borders underwent significant changes throughout its history, spanning from the 1st century BC to the 5th century AD. Here's an overview of the major expansions and contractions:

**Early Expansion (1st century BC - 1st century AD)**

- The Roman Republic initially expanded its territories through the Italian peninsula, conquering the Etruscan and Samnite cities.
- In 146 BC, Rome conquered Greece, and by 133 BC, it had expanded into Spain.
- The Roman Republic then expanded into Gaul (modern-day France and Belgium) and Britain, with Julius Caesar's conquests in 58-51 BC.
- The Roman Empire, established in 27 BC, continued to expand under the rule of Augustus, conquering Dacia (modern-day Romania) and parts of Germany.

**Pax Romana (1st century AD - 2nd century AD)**

- During the Pax Romana (Roman Peace), the empire's borders expanded to their greatest extent, covering:
- Western Europe: Gaul, Britain, Spain, and parts of Germany.
- North Africa: Egypt, Libya, Tunisia, and Algeria.
- Eastern Europe: Dacia, Illyricum (modern-day Albania and parts of Croatia), and parts of modern-day Bulgaria.
- Middle East: Syria, Lebanon, Israel, Palestine, and parts of Jordan.
- Mediterranean islands: Sicily, Sardinia, Corsica, and Crete.

**Decline and Contraction (2nd century AD - 5th century AD)**

- The empire faced numerous challenges, including internal power struggles, external invasions, and economic decline.
- The Roman Empire was divided into Eastern (Byzantine) and Western halves in 285 AD.
- The Western Roman Empire faced significant pressure from Germanic tribes, such as the Visigoths and Vandals, who eventually sacked Rome in 410 AD.
- The Western Roman Empire officially collapsed in 476 AD, when the Germanic king Odoacer deposed the last Roman Emperor, Romulus Augustus.
- The Eastern Roman Empire, also known as the Byzantine Empire, survived for another thousand years, with its capital in Constantinople (modern-day Istanbul).

**Notable Borders**

- The Limes Germanicus (German Border) marked the empire's northern border in Germany.
- The Danube River served as a natural border in Eastern Europe.
- The Rhine River marked the empire's western border in modern-day Germany
------------------------------