# Inference with Stained Glass Transform (SGT) Proxy and vLLM

This notebook demonstrates:

-  Various use-cases of inference from a **vLLM** instance running a Llama base model via [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) compatible clients while using **Stained Glass Transform Proxy** to protect user input prompts.
- Accessing the input embeddings (transformed and otherwise) and the reconstructed prompt from the transformed embeddings.

## Inference

### Pre-requisites

- A live instance of **vLLM (>=v0.9.1) OpenAI-Compatible Server**, with [prompt embeddings enabled](https://docs.vllm.ai/en/latest/features/prompt_embeds.html).
- A live instance of **SGT Proxy** (Please refer to the [deployment instructions](../index.html)).

### Chat Completions

We can perform inference on the **vLLM** instance by hitting the **SGT Proxy's**  [OpenAI Chat Completions API compatible endpoint](https://platform.openai.com/docs/api-reference/chat) via the following common interfaces. Let's walk through these methods.

!!! note "Configuration Required"

    Update these parameters for your specific setup:
    
    - **PROXY_URL**: Your proxy server endpoint
    - **MODEL_NAME**: The base model you want to test
    - **API_KEY**: Your authentication key

In [None]:
# Set proxy access parameters.
PROXY_URL = "http://127.0.0.1:8601/v1"
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
API_KEY = "<overwrite-with-your-api-key>"

#### OpenAI Client

1. Install `openai` python package.

In [2]:
%pip install openai

Note: you may need to restart the kernel to use updated packages.


2. Perform inference.

In [3]:
import openai

client = openai.OpenAI(base_url=PROXY_URL, api_key=API_KEY)

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {
            "role": "assistant",
            "content": "The Los Angeles Dodgers won the World Series in 2020.",
        },
        {"role": "user", "content": "Where was it played?"},
    ],
)

print(response.choices[0].message.content)

The 2020 World Series was played at Globe Life Field in Arlington, Texas


#### LangChain

1. Install `langchain-openai` python package.

In [4]:
%pip install langchain-openai

Note: you may need to restart the kernel to use updated packages.


2. Perform inference.

In [5]:
import langchain_openai
from langchain_core import output_parsers, prompts

llm = langchain_openai.ChatOpenAI(
    model=MODEL_NAME, base_url=PROXY_URL, api_key=API_KEY
)
prompt = prompts.ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful assistant."),
        ("user", "Who won the world series in 2020?"),
        ("assistant", "The Los Angeles Dodgers won the World Series in 2020."),
        ("user", "{input}"),
    ]
)
output_parser = output_parsers.StrOutputParser()

chain = prompt | llm | output_parser
print(chain.invoke({"input": "Where was it played?"}))

The 2020 World Series was played at Globe Life Field in Arlington, Texas


#### LiteLLM

1. Install `litellm` python package.

In [6]:
%pip install litellm

Collecting litellm
  Downloading litellm-1.77.7-py3-none-any.whl.metadata (42 kB)
Collecting fastuuid>=0.14.0 (from litellm)
  Downloading fastuuid-0.13.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.0 kB)
Downloading litellm-1.77.7-py3-none-any.whl (9.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.2/9.2 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fastuuid-0.13.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (272 kB)
Installing collected packages: fastuuid, litellm
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [litellm]m1/2[0m [litellm]
[1A[2KSuccessfully installed fastuuid-0.13.5 litellm-1.77.7
Note: you may need to restart the kernel to use updated packages.


2. Perform inference.

In [7]:
import litellm

response = litellm.completion(
    model=f"openai/{MODEL_NAME}",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {
            "role": "assistant",
            "content": "The Los Angeles Dodgers won the World Series in 2020.",
        },
        {"role": "user", "content": "Where was it played?"},
    ],
    base_url=PROXY_URL,
    api_key=API_KEY,
)

print(response.choices[0].message.content)

The 2020 World Series was played at Globe Life Field in Arlington, Texas


#### Magentic

1. Install `magentic` python package.

In [8]:
%pip install magentic

Collecting magentic
  Downloading magentic-0.40.0-py3-none-any.whl.metadata (21 kB)
Collecting filetype>=1.2.0 (from magentic)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting logfire-api>=0.55.0 (from magentic)
  Downloading logfire_api-4.12.0-py3-none-any.whl.metadata (972 bytes)
Downloading magentic-0.40.0-py3-none-any.whl (54 kB)
Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Downloading logfire_api-4.12.0-py3-none-any.whl (94 kB)
Installing collected packages: filetype, logfire-api, magentic
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [magentic]
[1A[2KSuccessfully installed filetype-1.2.0 logfire-api-4.12.0 magentic-0.40.0
Note: you may need to restart the kernel to use updated packages.


2. Perform inference.

In [9]:
import magentic


@magentic.chatprompt(
    magentic.SystemMessage("You are a helpful assistant."),
    magentic.UserMessage("Who won the world series in 2020?"),
    magentic.AssistantMessage(
        "The Los Angeles Dodgers won the World Series in 2020."
    ),
    magentic.UserMessage("{prompt}"),
)
def get_response(prompt: str) -> str:
    """Use magentic to get a response to the chat history and prompt.

    Magentic will automatically fill in the appropriate OpenAI API calls, which
    is why this function definition is empty.

    Args:
        prompt: The prompt to ask the model as the final user message.

    Returns:
        The response from the model to the prompt and chat history.
    """


with magentic.OpenaiChatModel(MODEL_NAME, api_key=API_KEY, base_url=PROXY_URL):
    response = get_response("Where was it played?")

print(response)

The 2020 World Series was played at Globe Life Field in Arlington, Texas


#### curl

**Request**

In [None]:
%%bash
curl --location 'http://127.0.0.1:8601/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ],
    "max_tokens": 3000,
    "temperature": 1.7,
    "seed": 123456
}' | python -m json.tool

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1183  100   735  100   448    592    361  0:00:01  0:00:01 --:--:--     0:--   447:00:01 --:--:--   954


{
    "id": "chatcmpl-e24cd5ddfc264b0ab0717885ee46e988",
    "object": "chat.completion",
    "created": 1760031272,
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The 2020 Major League Baseball (MLB) World Series took place from October 20th to October 27th at EMPTY Public gam\u00e9szones.",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 79,
        "total_tokens": 112,
        "completion_tokens": 33,
        "prompt_token

### Embeddings

We can hit the `/stainedglass` endpoint to fetch:

- Plain-text (un-transformed) embeddings
- Transformed embeddings
- Text prompt reconstructed from the transformed embeddings

As a custom endpoint in SGT Proxy,`/stainedglass` can be accessed in the following ways:

- curl
- Python

#### Python

Send a `POST` request and write the response to a json file. 

In [12]:
import json

import requests

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

INPUT_PROMPT_MESSAGES = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {
        "role": "assistant",
        "content": "The Los Angeles Dodgers won the World Series in 2020.",
    },
    {"role": "user", "content": "Where was it played?"},
]

data = {
    "messages": INPUT_PROMPT_MESSAGES,
    "return_plain_text_embeddings": True,
    "return_transformed_embeddings": True,
    "return_reconstructed_prompt": True,
    "skip_special_tokens": True,
}

with requests.post(
    f"{PROXY_URL}/stainedglass",
    headers=headers,
    json=data,
    stream=False,
    timeout=120,
) as response:
    response.raise_for_status()

    with open("response.json", "w") as json_file:
        json.dump(response.json(), json_file)

#### curl

Send a `curl` `POST` request and pipe the response to a json file.

In [None]:
%%bash
curl -X POST 'http://127.0.0.1:8601/v1/stainedglass' \
--header 'Content-Type: application/json' \
--data '{
  "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ],
  "return_plain_text_embeddings": true,
  "return_transformed_embeddings": true,
  "return_reconstructed_prompt": true,
  "skip_special_tokens": true
}' \
-o response.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100 4720k  100 4720k  100   480  6616k    672 --:--:-- --:--:-- --:--:-- 6611k20k  100 4720k  100   480  6615k    672 --:--:-- --:--:-- --:--:-- 6611k


**response.json**

The file output should be something like the following:

```json
{
    "plain_text_embeddings": [
      [
        0.0010528564453125,
        -0.000888824462890625,
        0.0021514892578125,
        -0.0036773681640625,
        ...
      ]
    ],
    "transformed_embeddings": [
      [
        -0.0005447890143841505,
        0.001484002685174346,
        -0.002132839523255825,
        0.008831249549984932,
        ...
      ]
    ],
    "reconstructed_prompt": "},\r();\r gepubliceFilters тогоess',\r\x0c]);\r',\r',\r //\r});\r },\r];\r',\r];\r });\r},\r\x1d\x85od';\r};\r //\r\x1c));\r //\r});\r },\r];\r');\r];\r>?[<',\r},\r //\r\x1d });\r"
}
```
As you can probably tell, the reconstructed prompt from transformed embeddings is an un-readable text block, far from the original input, thereby demonstrating the input prompt protection provided by **Stained Glass Transform Proxy**.

## Conclusion

- **Stained Glass Transform Proxy** can be used with a wide array of [OpenAI Chat Completions API](https://platform.openai.com/docs/guides/chat-completions) compatible clients.
- The `/stainedglass` endpoint offers an insight into the **SGT Proxy's** protection mechanisms by providing access to:
    - Plain (un-transformed) LLM embeddings.
    - Transformed LLM embeddings.
    - Reconstructed text from transformed embeddings.