Private RAG Chatbot with Stained Glass Transform Proxy¶
This notebook demonstrates how to build a private chatbot with a Retrieval Augmented Generation (RAG) architecture powered by Stained Glass Transform Proxy, Qdrant, and Gradio.
Overview¶
In today's data-driven world, leveraging large language models (LLMs) has become essential for building intelligent applications. However, most LLM providers require sending your private data to their servers — raising serious privacy concerns.
This tutorial shows you how to build a fully private RAG chatbot by integrating the Stained Glass Transform Proxy in place of any standard OpenAI-compatible endpoint. Because the Proxy is API-compatible with OpenAI, the integration requires almost no code changes to an existing application.
The high-level architecture:
- Document ingestion — upload files, split them into chunks, and store embeddings in an in-process Qdrant vector database.
- Retrieval — for every user message, embed the query, retrieve the top-k most relevant chunks from Qdrant, and inject them into the LLM prompt as context.
- Private inference — send the augmented prompt through the Stained Glass Transform Proxy so the upstream LLM provider never sees the plaintext.
- Interactive UI — serve everything through a Gradio web application that users can interact with in their browser.
Info
This notebook is a minimal-viable RAG application, for a production application, a more thorough framework might be preferable, such as OpenWebUI.
Install Dependencies¶
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["GRADIO_ANALYTICS_ENABLED"] = "false"
%uv pip install -q torch \
torchvision \
sentence-transformers==5.4.1 \
qdrant-client==1.16.1 \
openai==1.78.1 \
pypdf==5.9.0 \
gradio==5.29.0 \
gradio-client==1.10.0 \
einops==0.8.1
Note: you may need to restart the kernel to use updated packages.
Imports and Setup¶
import os
import pathlib
import uuid
from typing import Final
import gradio as gr
import openai
import pypdf
import qdrant_client
import qdrant_client.models
import requests
import sentence_transformers
import torch
/home/andrew/stained-glass-proxy/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Set Up the In-Memory Vector Database¶
We use Qdrant as the vector database for storing document
embeddings. The qdrant-client library ships with an in-process, in-memory backend
(":memory:"), so there is no Docker container or external service to manage.
Each user session gets its own collection, keyed by the Gradio session hash, so multiple users can upload different documents without interference.
Info
The in-memory backend stores all data in RAM and is discarded when the process
exits. For a production deployment you should switch to the Docker-based or
cloud-hosted Qdrant service instead — only the QdrantClient constructor argument
needs to change.
# In-memory Qdrant
_QDRANT_CLIENT = qdrant_client.QdrantClient(":memory:")
Embedding Model¶
We use Qwen/Qwen3-Embedding-0.6B as the sentence embedding model. It produces 1 024-dimensional embeddings and performs well on semantic search tasks out of the box.
The model is loaded once and reused for all embedding calls.
EMBEDDING_MODEL_ID = "Qwen/Qwen3-Embedding-0.6B"
_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
_embedding_model = sentence_transformers.SentenceTransformer(
EMBEDDING_MODEL_ID,
device=_DEVICE,
trust_remote_code=True,
)
def embed_texts(texts: list[str]) -> list[list[float]]:
"""Embed a list of texts using the Qwen3 embedding model.
Args:
texts: Strings to embed.
Returns:
A list of embedding vectors (one per input text).
"""
return _embedding_model.encode(texts, normalize_embeddings=True).tolist() # type: ignore[return-value]
# Determine the embedding dimension by running a dummy call.
EMBEDDING_DIM: int = len(embed_texts(["test"])[0])
print(f"Embedding dimension: {EMBEDDING_DIM}")
Loading weights: 100%|██████████| 13/13 [00:00<00:00, 19172.28it/s]
Embedding dimension: 8
Document Processing¶
Before we can retrieve relevant information from uploaded documents, we need to:
- Load the document (CSV, PDF, or plain text).
- Split it into overlapping chunks so each chunk fits comfortably within the model's context window.
- Embed each chunk and store it in Qdrant.
# Number of *characters* (not tokens) in each chunk.
CHUNK_SIZE = 2048
# Number of *characters* of overlap between consecutive chunks.
CHUNK_OVERLAP = 128
Load and Split Documents¶
We support three file types:
- PDF — text is extracted page by page with
pypdf. - CSV — each row is treated as a separate chunk (no further splitting).
- Plain text — split by the sliding-window chunker below.
_FILE_IS_REQUIRED_ERROR_MSG = "File is required"
def split_documents(file_path: str) -> list[dict[str, str]]:
"""Load a file and split it into overlapping text chunks.
Args:
file_path: Path to the document to load.
Returns:
A list of dicts, each with `"text"` and `"source"` keys.
Raises:
ValueError: If `file_path` is empty or the file does not exist.
"""
file = pathlib.Path(file_path)
if not file or not file.exists():
raise ValueError(_FILE_IS_REQUIRED_ERROR_MSG)
suffix = file.suffix.lower()
if suffix == ".csv":
# Treat each line as its own chunk — no further splitting needed.
lines = file.read_text(encoding="utf-8").splitlines()
return [
{"text": line, "source": str(file)}
for line in lines
if line.strip()
]
if suffix == ".pdf":
reader = pypdf.PdfReader(file)
full_text = "\n".join(
page.extract_text() or "" for page in reader.pages
)
else:
full_text = file.read_text(encoding="utf-8")
# Sliding-window chunker.
chunks: list[dict[str, str]] = []
step = max(1, CHUNK_SIZE - CHUNK_OVERLAP)
for start in range(0, len(full_text), step):
chunk_text = full_text[start : start + CHUNK_SIZE]
if chunk_text.strip():
chunks.append({"text": chunk_text, "source": str(file)})
return chunks
Create or Load a Qdrant Collection¶
Before we can store document embeddings we need a Qdrant collection — roughly equivalent to a table in a relational database. We lazily create one if it doesn't exist yet.
def ensure_collection(collection_name: str) -> None:
"""Create a Qdrant collection for `collection_name` if one does not already exist.
Args:
collection_name: Name of the collection to create.
"""
existing = {c.name for c in _QDRANT_CLIENT.get_collections().collections}
if collection_name not in existing:
_QDRANT_CLIENT.create_collection(
collection_name,
vectors_config=qdrant_client.models.VectorParams(
size=EMBEDDING_DIM,
distance=qdrant_client.models.Distance.COSINE,
),
)
Add Documents to the Vector Database¶
This function ties everything together: it loads a file, splits it into chunks, embeds each chunk, and upserts the resulting vectors into Qdrant.
def upload_and_create_vector_store(
file_paths: list[str],
collection_name: str,
) -> str:
"""Embed files and upsert their chunks into a Qdrant collection.
Args:
file_paths: Paths to the uploaded files (provided by Gradio).
collection_name: Name of the Qdrant collection to populate.
Returns:
A confirmation message to display in the Gradio interface.
"""
ensure_collection(collection_name)
for file_path in file_paths:
chunks = split_documents(file_path)
texts = [c["text"] for c in chunks]
vectors = embed_texts(texts)
points = [
qdrant_client.models.PointStruct(
id=str(uuid.uuid4()),
vector=vector,
payload={"text": text, "source": chunk["source"]},
)
for text, vector, chunk in zip(texts, vectors, chunks, strict=True)
]
_QDRANT_CLIENT.upsert(collection_name=collection_name, points=points)
return "Vector store created."
RAG Chat¶
With documents in Qdrant, we can now implement the retrieval-augmented generation loop:
- Retrieve the top-k chunks most similar to the user's query.
- Format a prompt that includes those chunks as context.
- Call the LLM (via the Stained Glass Transform Proxy) and return the answer.
We also maintain a per-session message history so the chatbot can refer back to earlier turns in the conversation.
# Per-session chat histories. Keys are Gradio session hashes.
CHAT_HISTORY_STORE: dict[str, list[dict[str, str]]] = {}
def retrieve_context(
collection_name: str,
query: str,
k: int = 3,
) -> list[str]:
"""Retrieve the top-`k` most relevant text chunks for `query`.
Args:
collection_name: Qdrant collection to search.
query: The user's question.
k: Number of chunks to retrieve.
Returns:
A list of text strings, most relevant first.
"""
existing = {c.name for c in _QDRANT_CLIENT.get_collections().collections}
if collection_name not in existing:
return []
query_vec = embed_texts([query])[0]
response = _QDRANT_CLIENT.query_points(
collection_name=collection_name,
query=query_vec,
limit=k,
)
return [
str(hit.payload.get("text", ""))
for hit in response.points
if hit.payload
]
Create a Gradio Interface¶
Now that we have all the building blocks we can wire them together in a Gradio application.
The interface has two panels:
- Left — a
ChatInterfacefor conversing with the RAG agent. Users can also customise the system prompt. - Right — a file uploader and a button to build the vector store from the uploaded documents.
When show_reconstructed_prompt=True (used with the Stained Glass Transform
Proxy), an extra text box shows an attempted reconstruction of the protected
prompt — a great way to visualise the privacy guarantees.
# Per-session store for the most recently sent (protected) prompt.
RECENT_PROMPT_STORE: dict[str, str] = {}
def fetch_session_hash(request: gr.Request) -> str | None:
"""Return the Gradio session hash from the current request.
Gradio uses the session hash as a unique identifier for each browser tab.
We use it as the Qdrant collection name so that each session gets its own
isolated document store.
Args:
request: The current Gradio request.
Returns:
The session hash string, or `None` if unavailable.
"""
return request.session_hash
def chat_bot(
message: str,
history: list[list[str]], # noqa: ARG001 — we maintain our own history
session: str,
system_prompt: str,
openai_url: str | None,
openai_model: str,
api_key: str,
) -> str:
"""Chat with the RAG conversation agent.
Retrieves relevant context from Qdrant, formats the prompt, and calls the
LLM via the Stained Glass Transform Proxy.
Args:
message: The latest message from the user.
history: The Gradio chat history (unused; we maintain our own).
session: The Gradio session hash, used as the Qdrant collection name.
system_prompt: The system instruction, which may contain a `{context}` placeholder.
openai_url: Base URL of the OpenAI-compatible endpoint.
openai_model: Model identifier to pass to the endpoint.
api_key: API key to pass to the OpenAI client.
Returns:
The assistant's reply.
"""
context_chunks = retrieve_context(session, message)
context = "\n\n".join(context_chunks)
system_with_context = system_prompt.format(context=context)
RECENT_PROMPT_STORE[session] = f"{system_with_context}\n\n{message}"
history_msgs = CHAT_HISTORY_STORE.setdefault(session, [])
messages: list[dict[str, str]] = [
{"role": "system", "content": system_with_context},
*history_msgs,
{"role": "user", "content": message},
]
llm_client = openai.OpenAI(base_url=openai_url, api_key=api_key)
response = llm_client.chat.completions.create(
model=openai_model,
messages=messages, # type: ignore[arg-type]
)
answer = response.choices[0].message.content or ""
history_msgs.append({"role": "user", "content": message})
history_msgs.append({"role": "assistant", "content": answer})
return answer
def see_stainedglass_reconstruction(
session: str,
openai_url: str,
) -> str:
"""Reconstruct the most recent protected prompt via the StainedGlass endpoint.
Args:
session: The Gradio session hash, used to look up the most recent prompt.
openai_url: Base URL of the Stained Glass Transform Proxy.
Returns:
The attempted plain-text reconstruction of the protected prompt.
"""
prompt = RECENT_PROMPT_STORE.get(session)
if prompt is None:
return "No prompt to reconstruct"
payload = {
"messages": [{"role": "user", "content": prompt}],
"return_transformed_embeddings": False,
"return_reconstructed_prompt": True,
"return_plain_text_embeddings": False,
}
resp = requests.post(
f"{openai_url}/stainedglass",
json=payload,
timeout=30,
).json()
return resp["reconstructed_prompt"]
def setup_chatbot_web_app(
openai_url: str | None,
openai_model: str,
api_key: str,
show_reconstructed_prompt: bool = False,
) -> gr.Blocks:
"""Build the Gradio interface for the RAG chatbot.
Args:
openai_url: Base URL of the OpenAI-compatible LLM endpoint.
openai_model: Model identifier.
api_key: API key for the OpenAI-compatible endpoint.
show_reconstructed_prompt: When `True`, display the SGT prompt
reconstruction panel (requires a Stained Glass Transform Proxy URL).
Returns:
The assembled `gr.Blocks` application (not yet launched).
"""
with gr.Blocks() as demo, gr.Row():
with gr.Column(scale=1, variant="panel"):
gr.Markdown("## RAG Conversation Agent")
system_prompt = gr.Textbox(
label="System instruction",
lines=3,
value=(
"You are an assistant for question-answering tasks. "
"Use the following pieces of retrieved context to answer the question. "
"If you don't know the answer, just say that you don't know. "
"Keep the answer concise. {context}"
),
)
session = gr.Textbox(value=uuid.uuid1, label="Session")
chat_interface = gr.ChatInterface(
fn=lambda message, history, session, system_prompt: chat_bot(
message,
history,
session,
system_prompt,
openai_url,
openai_model,
api_key,
),
additional_inputs=[session, system_prompt],
)
demo.load(fetch_session_hash, None, session)
if show_reconstructed_prompt:
gr.Markdown(
"### Attempted Text Reconstruction of Most Recent Prompt "
"Protected by Stained Glass Transform"
)
most_recent_prompt = gr.Textbox(
lines=3,
placeholder="Attempted text reconstruction...",
label="",
)
assert openai_url is not None
chat_interface.textbox.submit(
see_stainedglass_reconstruction,
[session, gr.State(openai_url)],
most_recent_prompt,
)
with gr.Column(scale=1, variant="panel"):
gr.Markdown("## Upload Document")
file = gr.File(type="filepath", file_count="multiple")
with gr.Row(equal_height=True), gr.Column(variant="compact"):
create_vector_store_button = gr.Button(
"Create vector store", variant="primary", scale=1
)
vector_index_msg_out = gr.Textbox(
show_label=False,
lines=1,
scale=1,
placeholder="Please create vector store...",
)
create_vector_store_button.click(
upload_and_create_vector_store,
[file, session],
vector_index_msg_out,
)
return demo
Launch Private RAG Chatbot with Stained Glass Transform Proxy¶
We now have all the pieces. The only change required to point the RAG chatbot at the Stained Glass Transform Proxy is to swap the endpoint URL.
Because the Proxy implements the OpenAI Chat Completions API, no other application code changes are needed.
Deploy the Stained Glass Transform Proxy¶
Before launching the Gradio app, make sure the proxy is running and accessible. See the Deployment Guides for step-by-step instructions.
By default the proxy listens on port 8601 of the local machine.
# Point the chatbot at the locally running Stained Glass Transform Proxy.
STAINED_GLASS_TRANSFORM_PROXY_URL = "http://localhost:8601/v1"
STAINED_GLASS_TRANSFORM_MODEL = "meta-llama/Llama-3.1-8B-Instruct"
STAINED_GLASS_API_KEY = "SomeMadeUpKeyForTesting"
Example
Try out the demo with some sample documents. After uploading the files and clicking the Create Vector Store button, the chatbot will be able to retrieve information from those documents.
Download a few public financial documents to try:
APP_PORT = 7860
protected_demo = setup_chatbot_web_app(
openai_url=STAINED_GLASS_TRANSFORM_PROXY_URL,
openai_model=STAINED_GLASS_TRANSFORM_MODEL,
api_key=STAINED_GLASS_API_KEY,
show_reconstructed_prompt=True,
)
protected_demo.launch(server_port=APP_PORT, inline=False, show_error=True)
/home/andrew/stained-glass-proxy/.venv/lib/python3.13/site-packages/gradio/chat_interface.py:338: UserWarning: The 'tuples' format for chatbot messages is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style 'role' and 'content' keys. self.chatbot = Chatbot(
* Running on local URL: http://127.0.0.1:7860 * To create a public link, set `share=True` in `launch()`.
Cleanup¶
Close the Gradio application when you are done.
protected_demo.close()
Closing server running on port: 7860