Manually requesting Stained Glass Transform embeddings and sending to vLLM¶

Stained Glass Transform Proxy's normal operation (when using the /v1/completions and /v1/chat/completions endpoints) transforms a prompt and then forwards the request to an upstream inference server that accepts prompt embeddings.

Instead of using SGT Proxy to forward requests, an application can request the transformed embeddings from SGT Proxy, and directly send the protected request to the upstream inference server.

In [ ]:

Copied!

%pip install requests==2.32.4 openai==1.97.0 torch==2.7.1
%pip install requests==2.32.4 openai==1.97.0 torch==2.7.1

In [1]:

Copied!





import base64
import io
import json
import pathlib

import openai
import requests
import torch
import base64
import io
import json
import pathlib

import openai
import requests
import torch

Configuration¶

SGT_PROXY_URL should point to the /v1/stainedglass endpoint. Update localhost:8600 to the host name and port of your SGT Proxy instance.
INFERENCE_SERVER_URL should point to the inference server's /v1 URL. This must be an OpenAI SDK-compatible inference server. We recommend vLLM>=0.9.2 if self-hosting.
INFERENCE_SERVER_API_KEY should be the API key for the inference server. If self-hosting vLLM, the value can be any string.
MODEL_NAME should be the model name.
TRANFORMED_EMBEDDINGS_FILEPATH should be the file path to a json file to fill with the outputs of the /v1/stainedglass endpoint.

In [ ]:

Copied!





SGT_PROXY_URL = "http://localhost:8600/v1/stainedglass"
INFERENCE_SERVER_URL = "http://localhost:8000/v1/"
INFERENCE_SERVER_API_KEY = "EMPTY"
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

TRANSFORMED_EMBEDDINGS_FILEPATH = pathlib.Path("stainedglass_output.json")
SGT_PROXY_URL = "http://localhost:8600/v1/stainedglass"
INFERENCE_SERVER_URL = "http://localhost:8000/v1/"
INFERENCE_SERVER_API_KEY = "EMPTY"
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

TRANSFORMED_EMBEDDINGS_FILEPATH = pathlib.Path("stainedglass_output.json")

`/v1/stainedglass` Request¶

The /v1/stainedglass can be used to get transformed embeddings for a prompt. It accepts messages just like the /v1/chat/completions endpoint in the OpenAI API specification. We will return transformed prompt embeddings, plain-text prompt embeddings, and an attempted reconstruction of the prompt from the transformed embeddings.

For this endpoint, no data is sent to the upstream server. The transformation occurs locally within the SGT Proxy.

In [ ]:

Copied!





STAINEDGLASS_REQUEST = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant with deep knowledge of geography and history.",
        },
        {
            "role": "user",
            "content": "Can you please tell me about the historical borders of the Roman Empire?",
        },
    ],
    "return_plain_text_embeddings": True,
    "return_transformed_embeddings": True,
    "return_reconstructed_prompt": True,
    "skip_special_tokens": False,
}
STAINEDGLASS_REQUEST = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant with deep knowledge of geography and history.",
        },
        {
            "role": "user",
            "content": "Can you please tell me about the historical borders of the Roman Empire?",
        },
    ],
    "return_plain_text_embeddings": True,
    "return_transformed_embeddings": True,
    "return_reconstructed_prompt": True,
    "skip_special_tokens": False,
}

NOTE: In the cell below, we save the entirety of the response to a JSON file, but we only visualize a portion of the embeddings. Prompt embeddings can be very large. We truncate the beginning and end of the embeddings for visualization purposes only, because they represent parts of the system-defined chat template which are not transformed. All user inputs are transformed. Transforming all tokens is a configurable setting.

In [ ]:

Copied!





with requests.post(
    SGT_PROXY_URL, json=STAINEDGLASS_REQUEST, timeout=60
) as stainedglass_response:
    stainedglass_response_json = stainedglass_response.json()

TRANSFORMED_EMBEDDINGS_FILEPATH.write_text(
    json.dumps(stainedglass_response_json)
)

stainedglass_response_json["plain_text_embeddings_tensor"] = torch.tensor(
    stainedglass_response_json.pop("plain_text_embeddings")
)
stainedglass_response_json["transformed_embeddings_tensor"] = torch.tensor(
    stainedglass_response_json.pop("transformed_embeddings")
)

print("plain_text_embeddings_tensor")
print(stainedglass_response_json["plain_text_embeddings_tensor"][25:-5])
print("-" * 30)
print("transformed_embeddings_tensor")
print(stainedglass_response_json["transformed_embeddings_tensor"][25:-5])
print("-" * 30)
print("reconstructed_prompt")
print(stainedglass_response_json["reconstructed_prompt"])
print("-" * 30)
with requests.post(
    SGT_PROXY_URL, json=STAINEDGLASS_REQUEST, timeout=60
) as stainedglass_response:
    stainedglass_response_json = stainedglass_response.json()

TRANSFORMED_EMBEDDINGS_FILEPATH.write_text(
    json.dumps(stainedglass_response_json)
)

stainedglass_response_json["plain_text_embeddings_tensor"] = torch.tensor(
    stainedglass_response_json.pop("plain_text_embeddings")
)
stainedglass_response_json["transformed_embeddings_tensor"] = torch.tensor(
    stainedglass_response_json.pop("transformed_embeddings")
)

print("plain_text_embeddings_tensor")
print(stainedglass_response_json["plain_text_embeddings_tensor"][25:-5])
print("-" * 30)
print("transformed_embeddings_tensor")
print(stainedglass_response_json["transformed_embeddings_tensor"][25:-5])
print("-" * 30)
print("reconstructed_prompt")
print(stainedglass_response_json["reconstructed_prompt"])
print("-" * 30)

plain_text_embeddings_tensor
tensor([[-0.0045,  0.0010, -0.0065,  ...,  0.0116,  0.0031, -0.0006],
        [ 0.0036,  0.0004,  0.0011,  ...,  0.0022,  0.0006,  0.0082],
        [-0.0007,  0.0002, -0.0010,  ..., -0.0110, -0.0040, -0.0001],
        ...,
        [-0.0003, -0.0021, -0.0068,  ..., -0.0157,  0.0054,  0.0069],
        [ 0.0146, -0.0042,  0.0120,  ..., -0.0012, -0.0181,  0.0069],
        [-0.0049, -0.0016,  0.0064,  ...,  0.0020, -0.0010, -0.0049]])
------------------------------
transformed_embeddings_tensor
tensor([[-0.0014,  0.0264,  0.0427,  ..., -0.0884, -0.0439,  0.0171],
        [-0.0245, -0.0145,  0.0126,  ..., -0.0322,  0.0029, -0.0435],
        [ 0.0137, -0.0048,  0.0067,  ...,  0.0479, -0.0038,  0.0056],
        ...,
        [ 0.0459,  0.0334,  0.0179,  ..., -0.0381,  0.0193, -0.0447],
        [ 0.0425,  0.0295, -0.0347,  ..., -0.0457,  0.0188,  0.0082],
        [ 0.0129, -0.0006, -0.0173,  ..., -0.0056,  0.0023,  0.0162]])
------------------------------
reconstructed_prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

*angstromtalya 百度收录ческихerusform醴醴dıktanerusformárníукраїн<|eot_id|><|start_header_id|>user<|end_header_id|>

 กรกฎџџџџџџџџџџџџџџџџ },

џџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџ зазначıntıquotelevnamespace uvědomdıktanquotelevџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџ*******
useRalativeImagePath<|eot_id|><|start_header_id|>assistant<|end_header_id|>


------------------------------

Sending transformed prompt embeddings to vLLM¶

vLLM accepts prompt embeddings in its /v1/completions endpoint via the prompt_embeds key. vLLM expects those embedding tensors to be base64 encoded. We will manually do that encoding and send them to vLLM directly (without using Proxy), so we know the entire payload sent.

For more information on Prompt Embeddings support in vLLM, see the vLLM Prompt Embeddings Documentation.

In [ ]:

Copied!





buffer = io.BytesIO()
torch.save(stainedglass_response_json["transformed_embeddings_tensor"], buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode("utf-8")
buffer = io.BytesIO()
torch.save(stainedglass_response_json["transformed_embeddings_tensor"], buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode("utf-8")

In [ ]:

Copied!





client = openai.OpenAI(
    api_key=INFERENCE_SERVER_API_KEY, base_url=INFERENCE_SERVER_URL
)
completion = client.completions.create(
    model=MODEL_NAME,
    # We use an empty string for the prompt to ensure that no plaintext
    # is sent.
    prompt="",
    max_tokens=512,
    temperature=0.0,
    # Only the transformed prompt embeddings are sent to the inference server
    extra_body={"prompt_embeds": encoded_embeds},
)

print("-" * 30)
print(completion.choices[0].text)
print("-" * 30)
client = openai.OpenAI(
    api_key=INFERENCE_SERVER_API_KEY, base_url=INFERENCE_SERVER_URL
)
completion = client.completions.create(
    model=MODEL_NAME,
    # We use an empty string for the prompt to ensure that no plaintext
    # is sent.
    prompt="",
    max_tokens=512,
    temperature=0.0,
    # Only the transformed prompt embeddings are sent to the inference server
    extra_body={"prompt_embeds": encoded_embeds},
)

print("-" * 30)
print(completion.choices[0].text)
print("-" * 30)

------------------------------
The Roman Empire's borders underwent significant changes throughout its history, spanning from the 1st century BC to the 5th century AD. Here's an overview of the empire's major territorial expansions and contractions:

**Early Expansion (1st century BC - 1st century AD)**

* The Roman Republic initially expanded its territories in the Mediterranean, conquering the Italian peninsula, Sicily, Sardinia, Corsica, and parts of Spain (Hispania).
* Under the leadership of Julius Caesar, Rome expanded into Gaul (modern-day France and Belgium), Britain, and parts of Germany (Germania).
* The Roman Empire's borders extended to the Rhine River in the north, the Danube River in the east, and the Pyrenees Mountains in the west.

**Pax Romana (1st century AD - 2nd century AD)**

* During the Pax Romana (Roman Peace), the empire's borders expanded further, with the conquest of Dacia (modern-day Romania) and parts of Britain.
* The Roman Empire's borders reached their maximum extent under the rule of Emperor Trajan (98-117 AD), with territories stretching from Britain to Egypt, and from Spain to Syria.

**Eastern Expansion (2nd century AD - 3rd century AD)**

* The Roman Empire expanded into the Middle East, conquering parts of Mesopotamia (modern-day Iraq), Armenia, and parts of the Caucasus region.
* The empire's borders extended to the Euphrates River in the east, and the Black Sea in the north.

**Crisis of the Third Century (3rd century AD - 4th century AD)**

* The Roman Empire faced significant internal conflicts, external pressures, and economic troubles, leading to a decline in its territorial control.
* The empire's borders were breached by various barbarian tribes, including the Goths, Vandals, and Huns.
* The Roman Empire's borders contracted, with the loss of territories in Britain, Gaul, and parts of the Middle East.

**Late Empire (4th century AD - 5th century AD)**

* The Roman Empire was divided into Eastern (Byzantine) and Western halves, with the capital of the Western Roman Empire moved to Ravenna.
* The empire's borders continued to contract, with the loss of territories in North Africa, Spain, and Italy.
* The Western Roman Empire was eventually overrun by barbarian tribes, marking the end of the Western Roman Empire in 476 AD.

------------------------------

Manually requesting Stained Glass Transform embeddings and sending to vLLM¶

Configuration¶

/v1/stainedglass Request¶

Sending transformed prompt embeddings to vLLM¶

`/v1/stainedglass` Request¶