Environment Variables¶
ProxySettings
pydantic-model
¶
Settings for Stained Glass Proxy.
Note
Any of these can be set via environment variables with the prefix SGP_. For example, to set
inference_service_host="http://localhost", set the environment variable SGP_INFERENCE_SERVICE_HOST="http://localhost".
Config:
env_prefix:SGP_
Fields:
-
otel_hide_sensitive_attributes(bool) -
inference_service_host(str) -
sgt_path(str) -
sgt_name(str | None) -
min_new_tokens(int | None) -
seed(int | None) -
temperature(float) -
top_p(float) -
top_k(int) -
repetition_penalty(float) -
upstream_keep_alive_timeout(float) -
session_timeout(float) -
api_username(str | None) -
api_password(str | None) -
use_aiohttp_for_upstream(bool) -
sagemaker_endpoint_name(str | None) -
device(Literal['cpu', 'cuda', 'mps']) -
num_sgt_workers(int) -
compile_noise_layer_forward(bool | None) -
tensor_parallel_size(int | None) -
grace_period_seconds(float) -
worker_ready_timeout_seconds(float | None) -
sgt_torch_dtype(str | None) -
sgt_noise_layer_attention(SupportedAttentionImplementationsType) -
max_input_tokens(int | None) -
sgt_third_party_model_path(str | PathLike[str] | None) -
sgt_trust_remote_code(bool) -
output_decryption(bool) -
ephemeral_key_refresh_time_seconds(float) -
client_public_key_header_name(str) -
server_public_key_header_name(str) -
reconstruction_max_batch_size(int | None) -
reconstruction_max_sequence_length(int | None) -
reconstruction_max_num_embeddings(int | None) -
tool_parser(str | None) -
_debug_clean_embeds(bool) -
custom_chat_template_file(str | None) -
profile(bool) -
profile_data_folder(Path) -
logging_config_file(str) -
license_manager_configuration_file(Path) -
allowed_headers(list[str]) -
cors_allow_origins(list[str]) -
cors_allow_headers(list[str]) -
request_headers_to_add(dict[str, str] | None) -
request_headers_to_log(list[str] | None) -
enable_opentelemetry(bool) -
otel_excluded_urls(list[str])
Validators:
-
_validate_device→device -
_validate_tensor_parallel_size→tensor_parallel_size -
_validate_inference_service_host→inference_service_host -
_split_allowed_headers→allowed_headers -
_split_cors_allowed_origins→cors_allow_origins -
_split_cors_allow_headers→cors_allow_headers -
_split_request_headers_to_log→request_headers_to_log -
_validate_request_headers_to_add→request_headers_to_add -
_validate_compile_noise_layer_forward -
_set_default_sgt_noise_layer_attention -
_validate_all -
_check_sagemaker_incompatibilities -
_set_otel_excluded_urls
inference_service_host
pydantic-field
¶
inference_service_host: str
The hostname of the upstream service.
This should include the protocol (http or https) and the port if not the default.
Examples:
- http://localhost:8000
- https://example.com:443
- http://vllm:8080
sgt_path
pydantic-field
¶
sgt_path: str
Path to the Stained Glass Transform file.
This can be a path to a .sgt zipfile or a model name on the Hugging Face Hub
(such as Protopia/SGT-for-llama-3.1-8b-instruct-rare-rain-bfloat16). Passing a local directory is not supported.
sgt_name
pydantic-field
¶
sgt_name: str | None = None
Optional name to use for the SGT in the /models endpoint and as the model id for /v1/chat/completions and /v1/completions requests.
When this is set, the model can be referred to either as <base model name> or <base model name>/<sgt name>, and both options will be visible from the /models endpoint.
Note
If set this will override the SGT name stored in the SGT file.
min_new_tokens
pydantic-field
¶
min_new_tokens: int | None = None
The minimum number of new tokens to generate.
repetition_penalty
pydantic-field
¶
repetition_penalty: float = 1.0
The default repetition penalty for generation.
upstream_keep_alive_timeout
pydantic-field
¶
upstream_keep_alive_timeout: float = 5
Timeout for idle connections with the upstream inference server.
session_timeout
pydantic-field
¶
session_timeout: float = 60
Timeout for connections with the upstream inference server.
api_username
pydantic-field
¶
api_username: str | None = None
The username for upstream inference server authentication.
api_password
pydantic-field
¶
api_password: str | None = None
The password for upstream inference server authentication.
use_aiohttp_for_upstream
pydantic-field
¶
use_aiohttp_for_upstream: bool = False
Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput when handling many concurrent requests.
sagemaker_endpoint_name
pydantic-field
¶
sagemaker_endpoint_name: str | None = None
Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.
When set, inference_service_host must be set to the empty string, otherwise, the proxy will throw an error while loading.
device
pydantic-field
¶
device: Literal['cpu', 'cuda', 'mps'] = 'cpu'
Device that Stained Glass Transform will run on.
Note
When set to cuda, the cuda:0 device is used unless tensor_parallel_size is set, in which case the tensor parallel settings
will take precedence. To avoid using the cuda:0 device in non-tensor parallel environments, we recommend setting the
CUDA_VISIBLE_DEVICES environment variable.
For example, if you would like the proxy to use cuda:1 instead of cuda:0, you would also set the CUDA_VISIBLE_DEVICES=1
environment variable.
Warning
"mps" is only supported on Apple Silicon Macs. This support is considered experimental.
num_sgt_workers
pydantic-field
¶
num_sgt_workers: int = 1
The number of SGT workers.
Increasing this may improve throughput when handling many concurrent requests, but will also increase memory usage.
compile_noise_layer_forward
pydantic-field
¶
compile_noise_layer_forward: bool | None = None
Whether to compile the noise layer's forward function.
Compiling the forward function can provide around a 2x speedup when using CUDA. On MPS, this feature is still experimental, and enabling it is not recommended until it becomes more stable.
tensor_parallel_size
pydantic-field
¶
tensor_parallel_size: int | None = None
Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.
grace_period_seconds
pydantic-field
¶
grace_period_seconds: float = 5
The grace period in seconds to wait for workers to shutdown.
worker_ready_timeout_seconds
pydantic-field
¶
worker_ready_timeout_seconds: float | None = None
The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.
sgt_torch_dtype
pydantic-field
¶
sgt_torch_dtype: str | None = None
The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.
For most use cases, we recommend this be "torch.bfloat16". If you are seeing unexpectedly large memory consumption, try explicitly
setting this option.
sgt_noise_layer_attention
pydantic-field
¶
The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.
For most use cases, we recommend this to be "flash_attention_2". If you are seeing unexpectedly large memory consumption, try explicitly
setting this option.
Warning
Not all attention mechanisms may be available for all dtypes and devices.
max_input_tokens
pydantic-field
¶
max_input_tokens: int | None = None
The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.
Requests with token count greater than this value will be rejected with a 413 error code.
sgt_third_party_model_path
pydantic-field
¶
The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends
on transformers which are not importable directly through transformers, but are present on the Hugging Face Hub. Typically this should
be None or False.
sgt_trust_remote_code
pydantic-field
¶
sgt_trust_remote_code: bool = False
Whether to trust remote code when loading from HuggingFace Hub.
Warning
Enabling this allows execution of arbitrary code from the model repository on Hugging Face Hub. Only enable this for models from trusted sources.
output_decryption
pydantic-field
¶
output_decryption: bool = False
Whether to decrypt the output from the upstream service.
Warning
This should only be enabled when the upstream inference service uses output encryption.
ephemeral_key_refresh_time_seconds
pydantic-field
¶
ephemeral_key_refresh_time_seconds: float = 15 * 60
The time in seconds to refresh the ephemeral key, when output decryption is enabled.
client_public_key_header_name
pydantic-field
¶
client_public_key_header_name: str = 'x-client-public-key'
The name of the HTTP header to use for passing the client public key to the upstream inference server.
server_public_key_header_name
pydantic-field
¶
server_public_key_header_name: str = 'x-server-public-key'
The name of the HTTP header to use for parsing the server public key from the downstream client.
reconstruction_max_batch_size
pydantic-field
¶
reconstruction_max_batch_size: int | None = None
This can be used to limit the batch size for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint,
but may also increase the processing time.
reconstruction_max_sequence_length
pydantic-field
¶
reconstruction_max_sequence_length: int | None = None
This can be used to limit the sequence length for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint,
but may also increase the processing time.
reconstruction_max_num_embeddings
pydantic-field
¶
reconstruction_max_num_embeddings: int | None = None
This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint,
but may also increase the processing time.
tool_parser
pydantic-field
¶
tool_parser: str | None = None
The tool parser to use. Available parsers can be found with vllm serve -h | grep tool-call-parser. If None, tool calls are not supported.
custom_chat_template_file
pydantic-field
¶
custom_chat_template_file: str | None = None
Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.
profile
pydantic-field
¶
profile: bool = False
Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.
profile_data_folder
pydantic-field
¶
profile_data_folder: Path
The folder to store profiling data. This is only used if profiling is enabled.
logging_config_file
pydantic-field
¶
logging_config_file: str = 'logging.yaml'
The logging configuration file.
license_manager_configuration_file
pydantic-field
¶
license_manager_configuration_file: Path
The path to the license manager configuration file.
allowed_headers
pydantic-field
¶
List of allowed headers that will be forwarded on to the downstream service.
This is expected to be a comma separated list. ie 'Modal-Key, Modal-Secret'.
cors_allow_origins
pydantic-field
¶
List of allowed origins for CORS. This is expected to be a list of URLs. The default is to allow all origins.
cors_allow_headers
pydantic-field
¶
List of allowed headers for CORS. This is expected to be a list of http headers. The default is to allow all headers.
request_headers_to_add
pydantic-field
¶
A dictionary of headers to add to each request to the upstream inference service.
request_headers_to_log
pydantic-field
¶
List of request headers to include in log messages for each request.
This is expected to be a comma separated list. ie 'X-Request-ID, User-Agent'. Only headers present in the request will be logged.
enable_opentelemetry
pydantic-field
¶
enable_opentelemetry: bool = False
Whether to enable OpenTelemetry instrumentation for the proxy.
otel_excluded_urls
pydantic-field
¶
Comma separated list of URL paths to exclude from OpenTelemetry instrumentation.
Note
This setting is dynamically set based on the value of the environment variable
OTEL_PYTHON_FASTAPI_EXCLUDED_URLS. If not set, it defaults to an empty list.