Environment Variables¶

ProxySettings ¶

Settings for Stained Glass Proxy.

Note

Any of these can be set via environment variables with the prefix SGP_. For example, to set inference_service_host="http://localhost", set the environment variable SGP_INFERENCE_SERVICE_HOST="http://localhost".

Attributes:

Name	Type	Description
`inference_service_host`	`str`	The hostname of the upstream service.
`sgt_path`	`str`	Path to the Stained Glass Transform file.
`sgt_name`	`str \| None`	Optional name to use for the SGT in the `/models` endpoint and as the model id for `/v1/chat/completions` and `/v1/completions` requests.
`min_new_tokens`	`int \| None`	The minimum number of new tokens to generate.
`seed`	`int \| None`	The seed for Stained Glass Transform and inference.
`temperature`	`float`	The default temperature for generation.
`top_p`	`float`	The default top-p value for generation.
`top_k`	`int`	The default top-k value for generation.
`repetition_penalty`	`float`	The default repetition penalty for generation.
`upstream_keep_alive_timeout`	`float`	Timeout for idle connections with the upstream inference server.
`session_timeout`	`float`	Timeout for connections with the upstream inference server.
`api_username`	`str \| None`	The username for upstream inference server authentication.
`api_password`	`str \| None`	The password for upstream inference server authentication.
`use_aiohttp_for_upstream`	`bool`	Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput
`sagemaker_endpoint_name`	`str \| None`	Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.
`device`	`Literal['cpu', 'cuda', 'mps']`	Device that Stained Glass Transform will run on.
`num_sgt_workers`	`int`	The number of SGT workers.
`compile_noise_layer_forward`	`bool \| None`	Whether to compile the noise layer's forward function.
`tensor_parallel_size`	`int \| None`	Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.
`grace_period_seconds`	`int`	The grace period in seconds to wait for workers to shutdown.
`worker_ready_timeout_seconds`	`float \| None`	The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.
`sgt_torch_dtype`	`str \| None`	The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.
`sgt_noise_layer_attention`	`Literal['sdpa', 'flash_attention_2', 'flash_attention_3', 'flex_attention'] \| None`	The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.
`max_input_tokens`	`int \| None`	The maximum number of input tokens to allow in a `/v1/chat/completions` or `/v1/stainedglass` request.
`output_decryption`	`bool`	Whether to decrypt the output from the upstream service.
`ephemeral_key_refresh_time_seconds`	`float`	The time in seconds to refresh the ephemeral key, when output decryption is enabled.
`reconstruction_max_batch_size`	`int \| None`	This can be used to limit the batch size for reconstruction tasks and its memory usage.
`reconstruction_max_sequence_length`	`int \| None`	This can be used to limit the sequence length for reconstruction tasks and its memory usage.
`reconstruction_max_num_embeddings`	`int \| None`	This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.
`tool_parser`	`str \| None`	The tool parser to use. Available parsers can be found with `vllm serve -h \| grep tool-call-parser`. If None, tool calls are not supported.
`custom_chat_template_file`	`str \| None`	Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.
`profile`	`bool`	Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.
`profile_data_folder`	`Path`	The folder to store profiling data. This is only used if profiling is enabled.
`logging_config_file`	`str`	The logging configuration file.
`license_manager_configuration_file`	`Path`	The path to the license manager configuration file.
`allowed_headers`	`Annotated[list[str], NoDecode]`	List of allowed headers that will be forwarded on to the downstream service.
`cors_allow_origins`	`Annotated[list[str], NoDecode]`	List of allowed origins for CORS. This is expected to be a list of URLs.
`cors_allow_headers`	`Annotated[list[str], NoDecode]`	List of allowed headers for CORS. This is expected to be a list of http headers.
`request_headers_to_add`	`Annotated[dict[str, str] \| None, NoDecode]`	A dictionary of headers to add to each request to the upstream inference service.

inference_service_host `instance-attribute` ¶

inference_service_host: str

The hostname of the upstream service.

This should include the protocol (http or https) and the port if not the default.

Examples:

http://localhost:8000
https://example.com:443
http://vllm:8080

sgt_path `instance-attribute` ¶

sgt_path: str

Path to the Stained Glass Transform file.

This can be a path to a .sgt zipfile or a model name on the Hugging Face Hub (such as Protopia/SGT-for-llama-3.1-8b-instruct-rare-rain-bfloat16). Passing a local directory is not supported.

sgt_name `class-attribute` `instance-attribute` ¶

sgt_name: str | None = None

Optional name to use for the SGT in the /models endpoint and as the model id for /v1/chat/completions and /v1/completions requests.

When this is set, the model can be referred to either as <base model name> or <base model name>/<sgt name>, and both options will be visible from the /models endpoint.

Note

If set this will override the SGT name stored in the SGT file.

min_new_tokens `class-attribute` `instance-attribute` ¶

min_new_tokens: int | None = None

The minimum number of new tokens to generate.

seed `class-attribute` `instance-attribute` ¶

seed: int | None = None

The seed for Stained Glass Transform and inference.

temperature `class-attribute` `instance-attribute` ¶

temperature: float = 0.3

The default temperature for generation.

top_p `class-attribute` `instance-attribute` ¶

top_p: float = 0.2

The default top-p value for generation.

top_k `class-attribute` `instance-attribute` ¶

top_k: int = 5000

The default top-k value for generation.

repetition_penalty `class-attribute` `instance-attribute` ¶

repetition_penalty: float = 1.0

The default repetition penalty for generation.

upstream_keep_alive_timeout `class-attribute` `instance-attribute` ¶

upstream_keep_alive_timeout: float = 5

Timeout for idle connections with the upstream inference server.

session_timeout `class-attribute` `instance-attribute` ¶

session_timeout: float = 60

Timeout for connections with the upstream inference server.

api_username `class-attribute` `instance-attribute` ¶

api_username: str | None = None

The username for upstream inference server authentication.

api_password `class-attribute` `instance-attribute` ¶

api_password: str | None = None

The password for upstream inference server authentication.

use_aiohttp_for_upstream `class-attribute` `instance-attribute` ¶

use_aiohttp_for_upstream: bool = False

Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput when handling many concurrent requests.

sagemaker_endpoint_name `class-attribute` `instance-attribute` ¶

sagemaker_endpoint_name: str | None = None

Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.

When set, inference_service_host must be set to the empty string, otherwise, the proxy will throw an error while loading.

device `class-attribute` `instance-attribute` ¶

device: Literal['cpu', 'cuda', 'mps'] = 'cpu'

Device that Stained Glass Transform will run on.

Note

When set to cuda, the cuda:0 device is used unless tensor_parallel_size is set, in which case the tensor parallel settings will take precedence. To avoid using the cuda:0 device in non-tensor parallel environments, we recommend setting the CUDA_VISIBLE_DEVICES environment variable. For example, if you would like the proxy to use cuda:1 instead of cuda:0, you would also set the CUDA_VISIBLE_DEVICES=1 environment variable.

Warning

"mps" is only supported on Apple Silicon Macs. This support is considered experimental.

num_sgt_workers `class-attribute` `instance-attribute` ¶

num_sgt_workers: int = 1

The number of SGT workers.

Increasing this may improve throughput when handling many concurrent requests, but will also increase memory usage.

compile_noise_layer_forward `class-attribute` `instance-attribute` ¶

compile_noise_layer_forward: bool | None = None

Whether to compile the noise layer's forward function.

Compiling the forward function can provide around a 2x speedup when using CUDA. On MPS, this feature is still experimental, and enabling it is not recommended until it becomes more stable.

tensor_parallel_size `class-attribute` `instance-attribute` ¶

tensor_parallel_size: int | None = None

Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.

grace_period_seconds `class-attribute` `instance-attribute` ¶

grace_period_seconds: int = 5

The grace period in seconds to wait for workers to shutdown.

worker_ready_timeout_seconds `class-attribute` `instance-attribute` ¶

worker_ready_timeout_seconds: float | None = None

The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.

sgt_torch_dtype `class-attribute` `instance-attribute` ¶

sgt_torch_dtype: str | None = None

The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.

For most use cases, we recommend this be "torch.bfloat16". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

sgt_noise_layer_attention `class-attribute` `instance-attribute` ¶

sgt_noise_layer_attention: (
    Literal[
        "sdpa",
        "flash_attention_2",
        "flash_attention_3",
        "flex_attention",
    ]
    | None
) = None

The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.

For most use cases, we recommend this to be "flash_attention_2". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

Warning

Not all attention mechanisms may be available for all dtypes and devices.

max_input_tokens `class-attribute` `instance-attribute` ¶

max_input_tokens: int | None = None

The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.

Requests with token count greater than this value will be rejected with a 413 error code.

output_decryption `class-attribute` `instance-attribute` ¶

output_decryption: bool = False

Whether to decrypt the output from the upstream service.

Warning

This should only be enabled when the upstream inference service uses output encryption.

ephemeral_key_refresh_time_seconds `class-attribute` `instance-attribute` ¶

ephemeral_key_refresh_time_seconds: float = 15 * 60

The time in seconds to refresh the ephemeral key, when output decryption is enabled.

reconstruction_max_batch_size `class-attribute` `instance-attribute` ¶

reconstruction_max_batch_size: int | None = None

This can be used to limit the batch size for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_sequence_length `class-attribute` `instance-attribute` ¶

reconstruction_max_sequence_length: int | None = None

This can be used to limit the sequence length for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_num_embeddings `class-attribute` `instance-attribute` ¶

reconstruction_max_num_embeddings: int | None = None

This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

tool_parser `class-attribute` `instance-attribute` ¶

tool_parser: str | None = None

The tool parser to use. Available parsers can be found with vllm serve -h | grep tool-call-parser. If None, tool calls are not supported.

custom_chat_template_file `class-attribute` `instance-attribute` ¶

custom_chat_template_file: str | None = None

Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.

profile `class-attribute` `instance-attribute` ¶

profile: bool = False

Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.

profile_data_folder `class-attribute` `instance-attribute` ¶

profile_data_folder: Path = Path('profile')

The folder to store profiling data. This is only used if profiling is enabled.

logging_config_file `class-attribute` `instance-attribute` ¶

logging_config_file: str = 'logging.yaml'

The logging configuration file.

license_manager_configuration_file `class-attribute` `instance-attribute` ¶

license_manager_configuration_file: Path = Path(
    "license_manager_configuration.json"
)

The path to the license manager configuration file.

allowed_headers `class-attribute` `instance-attribute` ¶

allowed_headers: Annotated[list[str], NoDecode] = []

List of allowed headers that will be forwarded on to the downstream service.

This is expected to be a comma separated list. ie 'Modal-Key, Modal-Secret'.

cors_allow_origins `class-attribute` `instance-attribute` ¶

cors_allow_origins: Annotated[list[str], NoDecode] = ['*']

List of allowed origins for CORS. This is expected to be a list of URLs. The default is to allow all origins.

cors_allow_headers `class-attribute` `instance-attribute` ¶

cors_allow_headers: Annotated[list[str], NoDecode] = ['*']

List of allowed headers for CORS. This is expected to be a list of http headers. The default is to allow all headers.

request_headers_to_add `class-attribute` `instance-attribute` ¶

request_headers_to_add: Annotated[
    dict[str, str] | None, NoDecode
] = None

A dictionary of headers to add to each request to the upstream inference service.

Environment Variables¶

ProxySettings ¶

inference_service_host instance-attribute ¶

sgt_path instance-attribute ¶

sgt_name class-attribute instance-attribute ¶

min_new_tokens class-attribute instance-attribute ¶

seed class-attribute instance-attribute ¶

temperature class-attribute instance-attribute ¶

top_p class-attribute instance-attribute ¶

top_k class-attribute instance-attribute ¶

repetition_penalty class-attribute instance-attribute ¶

upstream_keep_alive_timeout class-attribute instance-attribute ¶

session_timeout class-attribute instance-attribute ¶

api_username class-attribute instance-attribute ¶

api_password class-attribute instance-attribute ¶

use_aiohttp_for_upstream class-attribute instance-attribute ¶

sagemaker_endpoint_name class-attribute instance-attribute ¶

device class-attribute instance-attribute ¶

num_sgt_workers class-attribute instance-attribute ¶

compile_noise_layer_forward class-attribute instance-attribute ¶

tensor_parallel_size class-attribute instance-attribute ¶

grace_period_seconds class-attribute instance-attribute ¶

worker_ready_timeout_seconds class-attribute instance-attribute ¶

sgt_torch_dtype class-attribute instance-attribute ¶

sgt_noise_layer_attention class-attribute instance-attribute ¶

max_input_tokens class-attribute instance-attribute ¶

output_decryption class-attribute instance-attribute ¶

ephemeral_key_refresh_time_seconds class-attribute instance-attribute ¶

reconstruction_max_batch_size class-attribute instance-attribute ¶

reconstruction_max_sequence_length class-attribute instance-attribute ¶

reconstruction_max_num_embeddings class-attribute instance-attribute ¶

tool_parser class-attribute instance-attribute ¶

custom_chat_template_file class-attribute instance-attribute ¶

profile class-attribute instance-attribute ¶

profile_data_folder class-attribute instance-attribute ¶

logging_config_file class-attribute instance-attribute ¶

license_manager_configuration_file class-attribute instance-attribute ¶

allowed_headers class-attribute instance-attribute ¶

cors_allow_origins class-attribute instance-attribute ¶

cors_allow_headers class-attribute instance-attribute ¶

request_headers_to_add class-attribute instance-attribute ¶

inference_service_host `instance-attribute` ¶

sgt_path `instance-attribute` ¶

sgt_name `class-attribute` `instance-attribute` ¶

min_new_tokens `class-attribute` `instance-attribute` ¶

seed `class-attribute` `instance-attribute` ¶

temperature `class-attribute` `instance-attribute` ¶

top_p `class-attribute` `instance-attribute` ¶

top_k `class-attribute` `instance-attribute` ¶

repetition_penalty `class-attribute` `instance-attribute` ¶

upstream_keep_alive_timeout `class-attribute` `instance-attribute` ¶

session_timeout `class-attribute` `instance-attribute` ¶

api_username `class-attribute` `instance-attribute` ¶

api_password `class-attribute` `instance-attribute` ¶

use_aiohttp_for_upstream `class-attribute` `instance-attribute` ¶

sagemaker_endpoint_name `class-attribute` `instance-attribute` ¶

device `class-attribute` `instance-attribute` ¶

num_sgt_workers `class-attribute` `instance-attribute` ¶

compile_noise_layer_forward `class-attribute` `instance-attribute` ¶

tensor_parallel_size `class-attribute` `instance-attribute` ¶

grace_period_seconds `class-attribute` `instance-attribute` ¶

worker_ready_timeout_seconds `class-attribute` `instance-attribute` ¶

sgt_torch_dtype `class-attribute` `instance-attribute` ¶

sgt_noise_layer_attention `class-attribute` `instance-attribute` ¶

max_input_tokens `class-attribute` `instance-attribute` ¶

output_decryption `class-attribute` `instance-attribute` ¶

ephemeral_key_refresh_time_seconds `class-attribute` `instance-attribute` ¶

reconstruction_max_batch_size `class-attribute` `instance-attribute` ¶

reconstruction_max_sequence_length `class-attribute` `instance-attribute` ¶

reconstruction_max_num_embeddings `class-attribute` `instance-attribute` ¶

tool_parser `class-attribute` `instance-attribute` ¶

custom_chat_template_file `class-attribute` `instance-attribute` ¶

profile `class-attribute` `instance-attribute` ¶

profile_data_folder `class-attribute` `instance-attribute` ¶

logging_config_file `class-attribute` `instance-attribute` ¶

license_manager_configuration_file `class-attribute` `instance-attribute` ¶

allowed_headers `class-attribute` `instance-attribute` ¶

cors_allow_origins `class-attribute` `instance-attribute` ¶

cors_allow_headers `class-attribute` `instance-attribute` ¶

request_headers_to_add `class-attribute` `instance-attribute` ¶