Inference with Stained Glass Transform (SGT) Proxy and vLLM¶

This notebook demonstrates:

Various use-cases of inference from a vLLM instance running a Llama base model via OpenAI Chat Completions API compatible clients while using Stained Glass Transform Proxy to protect user input prompts.
Accessing the input embeddings (transformed and otherwise) and the reconstructed prompt from the transformed embeddings.

Inference¶

Pre-requisites¶

A live instance of vLLM (>=v0.9.1) OpenAI-Compatible Server, with prompt embeddings enabled.
A live instance of SGT Proxy (Please refer to the deployment instructions).

Chat Completions¶

We can perform inference on the vLLM instance by hitting the SGT Proxy's OpenAI Chat Completions API compatible endpoint via the following common interfaces. Let's walk through these methods.

Configuration Required

Update these parameters for your specific setup:

PROXY_URL: Your proxy server endpoint
MODEL_NAME: The base model you want to test
API_KEY: Your authentication key

In [1]:

Copied!





MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
API_KEY = "somemadeupkey123"
PROXY_PORT = 8601
PROXY_URL = "http://127.0.0.1:8601/v1"
SYSTEM_PROMPT = "You are a helpful assistant."
USER_PROMPT_1 = "Who won the world series in 2020?"
ASSISTANT_REPLY = "The Los Angeles Dodgers won the World Series in 2020."
USER_PROMPT_2 = "Where was it played?"
MAX_TOKENS = 3000
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
API_KEY = "somemadeupkey123"
PROXY_PORT = 8601
PROXY_URL = "http://127.0.0.1:8601/v1"
SYSTEM_PROMPT = "You are a helpful assistant."
USER_PROMPT_1 = "Who won the world series in 2020?"
ASSISTANT_REPLY = "The Los Angeles Dodgers won the World Series in 2020."
USER_PROMPT_2 = "Where was it played?"
MAX_TOKENS = 3000

OpenAI Client¶

Install openai python package.

In [ ]:

Copied!

%uv pip install -q openai
%uv pip install -q openai

Perform inference.

In [3]:

Copied!





import openai

client = openai.OpenAI(base_url=PROXY_URL, api_key=API_KEY)

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT_1},
        {"role": "assistant", "content": ASSISTANT_REPLY},
        {"role": "user", "content": USER_PROMPT_2},
    ],
    max_tokens=MAX_TOKENS,
)

print(response.choices[0].message.content)
import openai

client = openai.OpenAI(base_url=PROXY_URL, api_key=API_KEY)

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT_1},
        {"role": "assistant", "content": ASSISTANT_REPLY},
        {"role": "user", "content": USER_PROMPT_2},
    ],
    max_tokens=MAX_TOKENS,
)

print(response.choices[0].message.content)

The 2020 World Series was played at Globe Life Field in Arlington, Texas, but with the Los Angeles Dodgers being the home team, they had the home field advantage.

LangChain¶

Install langchain-openai python package.

In [ ]:

Copied!

%uv pip install -q langchain-openai
%uv pip install -q langchain-openai

Perform inference.

In [5]:

Copied!





import langchain_openai
from langchain_core import output_parsers, prompts

llm = langchain_openai.ChatOpenAI(
    model=MODEL_NAME, base_url=PROXY_URL, api_key=API_KEY
)
prompt = prompts.ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("user", USER_PROMPT_1),
        ("assistant", ASSISTANT_REPLY),
        ("user", "{input}"),
    ]
)
output_parser = output_parsers.StrOutputParser()

chain = prompt | llm | output_parser
print(chain.invoke({"input": USER_PROMPT_2}))
import langchain_openai
from langchain_core import output_parsers, prompts

llm = langchain_openai.ChatOpenAI(
    model=MODEL_NAME, base_url=PROXY_URL, api_key=API_KEY
)
prompt = prompts.ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("user", USER_PROMPT_1),
        ("assistant", ASSISTANT_REPLY),
        ("user", "{input}"),
    ]
)
output_parser = output_parsers.StrOutputParser()

chain = prompt | llm | output_parser
print(chain.invoke({"input": USER_PROMPT_2}))

/home/caleb/.conda/envs/proxy312clean/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

The 2020 World Series was played at Globe Life Field in Arlington, Texas. This was the first World Series to be played at a neutral site.

LiteLLM¶

Install litellm python package.

In [ ]:

Copied!

%uv pip install -q litellm
%uv pip install -q litellm

Perform inference.

In [7]:

Copied!





import litellm

response = litellm.completion(
    model=f"openai/{MODEL_NAME}",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT_1},
        {"role": "assistant", "content": ASSISTANT_REPLY},
        {"role": "user", "content": USER_PROMPT_2},
    ],
    max_tokens=MAX_TOKENS,
    base_url=PROXY_URL,
    api_key=API_KEY,
)

print(response.choices[0].message.content)
import litellm

response = litellm.completion(
    model=f"openai/{MODEL_NAME}",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT_1},
        {"role": "assistant", "content": ASSISTANT_REPLY},
        {"role": "user", "content": USER_PROMPT_2},
    ],
    max_tokens=MAX_TOKENS,
    base_url=PROXY_URL,
    api_key=API_KEY,
)

print(response.choices[0].message.content)

The 2020 World Series was played between the Los Angeles Dodgers and the Tampa Bay Rays. The series took place at Globe Life Field in Arlington, Texas. This was because of COVID-19 restrictions and the fact that the season was played in a bubble due to the pandemic.

Magentic¶

Install magentic python package.

In [ ]:

Copied!

%uv pip install -q magentic
%uv pip install -q magentic

Perform inference.

In [9]:

Copied!





import magentic


@magentic.chatprompt(
    magentic.SystemMessage(SYSTEM_PROMPT),
    magentic.UserMessage(USER_PROMPT_1),
    magentic.AssistantMessage(ASSISTANT_REPLY),
    magentic.UserMessage("{prompt}"),
)
def get_response(prompt: str) -> str:
    """Use magentic to get a response to the chat history and prompt.

    Magentic will automatically fill in the appropriate OpenAI API calls, which
    is why this function definition is empty.

    Args:
        prompt: The prompt to ask the model as the final user message.

    Returns:
        The response from the model to the prompt and chat history.
    """


with magentic.OpenaiChatModel(MODEL_NAME, api_key=API_KEY, base_url=PROXY_URL):
    response = get_response(USER_PROMPT_2)

print(response)
import magentic


@magentic.chatprompt(
    magentic.SystemMessage(SYSTEM_PROMPT),
    magentic.UserMessage(USER_PROMPT_1),
    magentic.AssistantMessage(ASSISTANT_REPLY),
    magentic.UserMessage("{prompt}"),
)
def get_response(prompt: str) -> str:
    """Use magentic to get a response to the chat history and prompt.

    Magentic will automatically fill in the appropriate OpenAI API calls, which
    is why this function definition is empty.

    Args:
        prompt: The prompt to ask the model as the final user message.

    Returns:
        The response from the model to the prompt and chat history.
    """


with magentic.OpenaiChatModel(MODEL_NAME, api_key=API_KEY, base_url=PROXY_URL):
    response = get_response(USER_PROMPT_2)

print(response)

The 2020 World Series was played at Globe Life Field in Arlington, Texas. Globe Life Field is the home stadium of the Texas Rangers. Due to COVID-19 restrictions and the pandemic's impact on the season, the 2020 World Series was held at this neutral site, with the Dodgers facing the Tampa Bay Rays.

curl¶

Request

In [10]:

skip-execution

Copied!





%%bash
curl --location 'http://127.0.0.1:8601/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer somemadeupkey123' \
--data '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ],
    "max_tokens": 50,
    "temperature": 1.7,
    "seed": 123456
}' | python -m json.tool
%%bash
curl --location 'http://127.0.0.1:8601/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer somemadeupkey123' \
--data '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ],
    "max_tokens": 50,
    "temperature": 1.7,
    "seed": 123456
}' | python -m json.tool

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100  1326  100   880  100   446    523    265  0:00:01  0:00:01 --:--:--   4450:00:01  0:00:01 --:--:--   789

{
    "id": "chatcmpl-37820900bd914dc88c1489ca64333922",
    "object": "chat.completion",
    "created": 1776107123,
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Due to  COVID such-thingications, they played their [*060 World Largest radios Most sonra-show imper kr connect differing adec multitude EMPTY Public gam deractivCare dynamic chap both Tampa teams positions attending William Warning RLocations dul BT/the Sri Lov rendez Kobe",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning": null
            },
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 79,
        "total_tokens": 129,
        "completion_tokens": 50,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "prompt_token_ids": null,
    "kv_transfer_params": null
}

Embeddings¶

We can hit the /stainedglass endpoint to fetch:

Plain-text (un-transformed) embeddings
Transformed embeddings
Text prompt reconstructed from the transformed embeddings

As a custom endpoint in SGT Proxy,/stainedglass can be accessed in the following ways:

curl
Python

Python¶

Send a POST request and write the response to a json file.

In [11]:

Copied!





import json
import pathlib

import requests

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

INPUT_PROMPT_MESSAGES = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT_1},
    {"role": "assistant", "content": ASSISTANT_REPLY},
    {"role": "user", "content": USER_PROMPT_2},
]

data = {
    "messages": INPUT_PROMPT_MESSAGES,
    "return_plain_text_embeddings": True,
    "return_transformed_embeddings": True,
    "return_reconstructed_prompt": True,
    "skip_special_tokens": True,
}

with requests.post(
    f"{PROXY_URL}/stainedglass",
    headers=headers,
    json=data,
    stream=False,
    timeout=120,
) as response:
    response.raise_for_status()

    pathlib.Path("response.json").write_text(
        json.dumps(response.json()), encoding="utf-8"
    )
import json
import pathlib

import requests

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

INPUT_PROMPT_MESSAGES = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT_1},
    {"role": "assistant", "content": ASSISTANT_REPLY},
    {"role": "user", "content": USER_PROMPT_2},
]

data = {
    "messages": INPUT_PROMPT_MESSAGES,
    "return_plain_text_embeddings": True,
    "return_transformed_embeddings": True,
    "return_reconstructed_prompt": True,
    "skip_special_tokens": True,
}

with requests.post(
    f"{PROXY_URL}/stainedglass",
    headers=headers,
    json=data,
    stream=False,
    timeout=120,
) as response:
    response.raise_for_status()

    pathlib.Path("response.json").write_text(
        json.dumps(response.json()), encoding="utf-8"
    )

curl¶

Send a curl POST request and pipe the response to a json file.

In [12]:

skip-execution

Copied!





%%bash
curl -X POST 'http://127.0.0.1:8601/v1/stainedglass' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer somemadeupkey123' \
--data '{
  "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ],
  "return_plain_text_embeddings": true,
  "return_transformed_embeddings": true,
  "return_reconstructed_prompt": true,
  "skip_special_tokens": true
}' \
-o response.json
%%bash
curl -X POST 'http://127.0.0.1:8601/v1/stainedglass' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer somemadeupkey123' \
--data '{
  "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ],
  "return_plain_text_embeddings": true,
  "return_transformed_embeddings": true,
  "return_reconstructed_prompt": true,
  "skip_special_tokens": true
}' \
-o response.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4723k  100 4722k  100   480  3384k    344  0:00:01  0:00:01 --:--:-- 3385k

response.json

The file output should be something like the following:

{
    "plain_text_embeddings": [
      [
        0.0010528564453125,
        -0.000888824462890625,
        0.0021514892578125,
        -0.0036773681640625,
        ...
      ]
    ],
    "transformed_embeddings": [
      [
        -0.0005447890143841505,
        0.001484002685174346,
        -0.002132839523255825,
        0.008831249549984932,
        ...
      ]
    ],
    "reconstructed_prompt": "},\r();\r gepubliceFilters тогоess',\r\x0c]);\r',\r',\r //\r});\r },\r];\r',\r];\r });\r},\r\x1d\x85od';\r};\r //\r\x1c));\r //\r});\r },\r];\r');\r];\r>?[<',\r},\r //\r\x1d });\r"
}

As you can probably tell, the reconstructed prompt from transformed embeddings is an un-readable text block, far from the original input, thereby demonstrating the input prompt protection provided by Stained Glass Transform Proxy.

Conclusion¶

Stained Glass Transform Proxy can be used with a wide array of OpenAI Chat Completions API compatible clients.
The /stainedglass endpoint offers an insight into the SGT Proxy's protection mechanisms by providing access to:
- Plain (un-transformed) LLM embeddings.
- Transformed LLM embeddings.
- Reconstructed text from transformed embeddings.