Inference with Stained Glass Transform (SGT) Proxy and vLLM¶
This notebook demonstrates:
- Various use-cases of inference from a vLLM instance running a Llama base model via OpenAI Chat Completions API compatible clients while using Stained Glass Transform Proxy to protect user input prompts.
- Accessing the input embeddings (transformed and otherwise) and the reconstructed prompt from the transformed embeddings.
Inference¶
Pre-requisites¶
- A live instance of vLLM (>=v0.9.1) OpenAI-Compatible Server, with prompt embeddings enabled.
- A live instance of SGT Proxy (Please refer to the deployment instructions).
Chat Completions¶
We can perform inference on the vLLM instance by hitting the SGT Proxy's OpenAI Chat Completions API compatible endpoint via the following common interfaces. Let's walk through these methods.
Configuration Required
Update these parameters for your specific setup:
- PROXY_URL: Your proxy server endpoint
- MODEL_NAME: The base model you want to test
- API_KEY: Your authentication key
In [1]:
Copied!
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
API_KEY = "somemadeupkey123"
PROXY_PORT = 8601
PROXY_URL = "http://127.0.0.1:8601/v1"
SYSTEM_PROMPT = "You are a helpful assistant."
USER_PROMPT_1 = "Who won the world series in 2020?"
ASSISTANT_REPLY = "The Los Angeles Dodgers won the World Series in 2020."
USER_PROMPT_2 = "Where was it played?"
MAX_TOKENS = 3000
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
API_KEY = "somemadeupkey123"
PROXY_PORT = 8601
PROXY_URL = "http://127.0.0.1:8601/v1"
SYSTEM_PROMPT = "You are a helpful assistant."
USER_PROMPT_1 = "Who won the world series in 2020?"
ASSISTANT_REPLY = "The Los Angeles Dodgers won the World Series in 2020."
USER_PROMPT_2 = "Where was it played?"
MAX_TOKENS = 3000
OpenAI Client¶
- Install
openaipython package.
In [ ]:
Copied!
%uv pip install -q openai
%uv pip install -q openai
- Perform inference.
In [3]:
Copied!
import openai
client = openai.OpenAI(base_url=PROXY_URL, api_key=API_KEY)
response = client.chat.completions.create(
model=MODEL_NAME,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT_1},
{"role": "assistant", "content": ASSISTANT_REPLY},
{"role": "user", "content": USER_PROMPT_2},
],
max_tokens=MAX_TOKENS,
)
print(response.choices[0].message.content)
import openai
client = openai.OpenAI(base_url=PROXY_URL, api_key=API_KEY)
response = client.chat.completions.create(
model=MODEL_NAME,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT_1},
{"role": "assistant", "content": ASSISTANT_REPLY},
{"role": "user", "content": USER_PROMPT_2},
],
max_tokens=MAX_TOKENS,
)
print(response.choices[0].message.content)
The 2020 World Series was played at Globe Life Field in Arlington, Texas, but with the Los Angeles Dodgers being the home team, they had the home field advantage.
LangChain¶
- Install
langchain-openaipython package.
In [ ]:
Copied!
%uv pip install -q langchain-openai
%uv pip install -q langchain-openai
- Perform inference.
In [5]:
Copied!
import langchain_openai
from langchain_core import output_parsers, prompts
llm = langchain_openai.ChatOpenAI(
model=MODEL_NAME, base_url=PROXY_URL, api_key=API_KEY
)
prompt = prompts.ChatPromptTemplate.from_messages(
[
("system", SYSTEM_PROMPT),
("user", USER_PROMPT_1),
("assistant", ASSISTANT_REPLY),
("user", "{input}"),
]
)
output_parser = output_parsers.StrOutputParser()
chain = prompt | llm | output_parser
print(chain.invoke({"input": USER_PROMPT_2}))
import langchain_openai
from langchain_core import output_parsers, prompts
llm = langchain_openai.ChatOpenAI(
model=MODEL_NAME, base_url=PROXY_URL, api_key=API_KEY
)
prompt = prompts.ChatPromptTemplate.from_messages(
[
("system", SYSTEM_PROMPT),
("user", USER_PROMPT_1),
("assistant", ASSISTANT_REPLY),
("user", "{input}"),
]
)
output_parser = output_parsers.StrOutputParser()
chain = prompt | llm | output_parser
print(chain.invoke({"input": USER_PROMPT_2}))
/home/caleb/.conda/envs/proxy312clean/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
The 2020 World Series was played at Globe Life Field in Arlington, Texas. This was the first World Series to be played at a neutral site.
LiteLLM¶
- Install
litellmpython package.
In [ ]:
Copied!
%uv pip install -q litellm
%uv pip install -q litellm
- Perform inference.
In [7]:
Copied!
import litellm
response = litellm.completion(
model=f"openai/{MODEL_NAME}",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT_1},
{"role": "assistant", "content": ASSISTANT_REPLY},
{"role": "user", "content": USER_PROMPT_2},
],
max_tokens=MAX_TOKENS,
base_url=PROXY_URL,
api_key=API_KEY,
)
print(response.choices[0].message.content)
import litellm
response = litellm.completion(
model=f"openai/{MODEL_NAME}",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT_1},
{"role": "assistant", "content": ASSISTANT_REPLY},
{"role": "user", "content": USER_PROMPT_2},
],
max_tokens=MAX_TOKENS,
base_url=PROXY_URL,
api_key=API_KEY,
)
print(response.choices[0].message.content)
The 2020 World Series was played between the Los Angeles Dodgers and the Tampa Bay Rays. The series took place at Globe Life Field in Arlington, Texas. This was because of COVID-19 restrictions and the fact that the season was played in a bubble due to the pandemic.
Magentic¶
- Install
magenticpython package.
In [ ]:
Copied!
%uv pip install -q magentic
%uv pip install -q magentic
- Perform inference.
In [9]:
Copied!
import magentic
@magentic.chatprompt(
magentic.SystemMessage(SYSTEM_PROMPT),
magentic.UserMessage(USER_PROMPT_1),
magentic.AssistantMessage(ASSISTANT_REPLY),
magentic.UserMessage("{prompt}"),
)
def get_response(prompt: str) -> str:
"""Use magentic to get a response to the chat history and prompt.
Magentic will automatically fill in the appropriate OpenAI API calls, which
is why this function definition is empty.
Args:
prompt: The prompt to ask the model as the final user message.
Returns:
The response from the model to the prompt and chat history.
"""
with magentic.OpenaiChatModel(MODEL_NAME, api_key=API_KEY, base_url=PROXY_URL):
response = get_response(USER_PROMPT_2)
print(response)
import magentic
@magentic.chatprompt(
magentic.SystemMessage(SYSTEM_PROMPT),
magentic.UserMessage(USER_PROMPT_1),
magentic.AssistantMessage(ASSISTANT_REPLY),
magentic.UserMessage("{prompt}"),
)
def get_response(prompt: str) -> str:
"""Use magentic to get a response to the chat history and prompt.
Magentic will automatically fill in the appropriate OpenAI API calls, which
is why this function definition is empty.
Args:
prompt: The prompt to ask the model as the final user message.
Returns:
The response from the model to the prompt and chat history.
"""
with magentic.OpenaiChatModel(MODEL_NAME, api_key=API_KEY, base_url=PROXY_URL):
response = get_response(USER_PROMPT_2)
print(response)
The 2020 World Series was played at Globe Life Field in Arlington, Texas. Globe Life Field is the home stadium of the Texas Rangers. Due to COVID-19 restrictions and the pandemic's impact on the season, the 2020 World Series was held at this neutral site, with the Dodgers facing the Tampa Bay Rays.
curl¶
Request
In [10]:
Copied!
%%bash
curl --location 'http://127.0.0.1:8601/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer somemadeupkey123' \
--data '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
],
"max_tokens": 50,
"temperature": 1.7,
"seed": 123456
}' | python -m json.tool
%%bash
curl --location 'http://127.0.0.1:8601/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer somemadeupkey123' \
--data '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
],
"max_tokens": 50,
"temperature": 1.7,
"seed": 123456
}' | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 1326 100 880 100 446 523 265 0:00:01 0:00:01 --:--:-- 4450:00:01 0:00:01 --:--:-- 789
{
"id": "chatcmpl-37820900bd914dc88c1489ca64333922",
"object": "chat.completion",
"created": 1776107123,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Due to COVID such-thingications, they played their [*060 World Largest radios Most sonra-show imper kr connect differing adec multitude EMPTY Public gam deractivCare dynamic chap both Tampa teams positions attending William Warning RLocations dul BT/the Sri Lov rendez Kobe",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": null
},
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 79,
"total_tokens": 129,
"completion_tokens": 50,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
Embeddings¶
We can hit the /stainedglass endpoint to fetch:
- Plain-text (un-transformed) embeddings
- Transformed embeddings
- Text prompt reconstructed from the transformed embeddings
As a custom endpoint in SGT Proxy,/stainedglass can be accessed in the following ways:
- curl
- Python
Python¶
Send a POST request and write the response to a json file.
In [11]:
Copied!
import json
import pathlib
import requests
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}",
}
INPUT_PROMPT_MESSAGES = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT_1},
{"role": "assistant", "content": ASSISTANT_REPLY},
{"role": "user", "content": USER_PROMPT_2},
]
data = {
"messages": INPUT_PROMPT_MESSAGES,
"return_plain_text_embeddings": True,
"return_transformed_embeddings": True,
"return_reconstructed_prompt": True,
"skip_special_tokens": True,
}
with requests.post(
f"{PROXY_URL}/stainedglass",
headers=headers,
json=data,
stream=False,
timeout=120,
) as response:
response.raise_for_status()
pathlib.Path("response.json").write_text(
json.dumps(response.json()), encoding="utf-8"
)
import json
import pathlib
import requests
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}",
}
INPUT_PROMPT_MESSAGES = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT_1},
{"role": "assistant", "content": ASSISTANT_REPLY},
{"role": "user", "content": USER_PROMPT_2},
]
data = {
"messages": INPUT_PROMPT_MESSAGES,
"return_plain_text_embeddings": True,
"return_transformed_embeddings": True,
"return_reconstructed_prompt": True,
"skip_special_tokens": True,
}
with requests.post(
f"{PROXY_URL}/stainedglass",
headers=headers,
json=data,
stream=False,
timeout=120,
) as response:
response.raise_for_status()
pathlib.Path("response.json").write_text(
json.dumps(response.json()), encoding="utf-8"
)
curl¶
Send a curl POST request and pipe the response to a json file.
In [12]:
Copied!
%%bash
curl -X POST 'http://127.0.0.1:8601/v1/stainedglass' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer somemadeupkey123' \
--data '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
],
"return_plain_text_embeddings": true,
"return_transformed_embeddings": true,
"return_reconstructed_prompt": true,
"skip_special_tokens": true
}' \
-o response.json
%%bash
curl -X POST 'http://127.0.0.1:8601/v1/stainedglass' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer somemadeupkey123' \
--data '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
],
"return_plain_text_embeddings": true,
"return_transformed_embeddings": true,
"return_reconstructed_prompt": true,
"skip_special_tokens": true
}' \
-o response.json
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4723k 100 4722k 100 480 3384k 344 0:00:01 0:00:01 --:--:-- 3385k
response.json
The file output should be something like the following:
{
"plain_text_embeddings": [
[
0.0010528564453125,
-0.000888824462890625,
0.0021514892578125,
-0.0036773681640625,
...
]
],
"transformed_embeddings": [
[
-0.0005447890143841505,
0.001484002685174346,
-0.002132839523255825,
0.008831249549984932,
...
]
],
"reconstructed_prompt": "},\r();\r gepubliceFilters тогоess',\r\x0c]);\r',\r',\r //\r});\r },\r];\r',\r];\r });\r},\r\x1d\x85od';\r};\r //\r\x1c));\r //\r});\r },\r];\r');\r];\r>?[<',\r},\r //\r\x1d });\r"
}
Conclusion¶
- Stained Glass Transform Proxy can be used with a wide array of OpenAI Chat Completions API compatible clients.
- The
/stainedglassendpoint offers an insight into the SGT Proxy's protection mechanisms by providing access to:- Plain (un-transformed) LLM embeddings.
- Transformed LLM embeddings.
- Reconstructed text from transformed embeddings.