Environment Variables¶

Configuration can be set using environment variables. Stained Glass Proxy uses Pydantic Settings to manage parsing environment variables.

Most settings are part of the root ProxySettings class, and those settings can be set using the environment variable names listed in the documentation for that attribute, prefixed by SGP_. For example, to set inference_service_host="http://localhost:9600", set the environment variable SGP_INFERENCE_SERVICE_HOST="http://localhost:9600".

Some attributes use a nested settings object, such as sgt_text. To set those nested settings, use the same prefixing strategy, but also include the name of the nested attribute after a double underscore (__). For example, to set sgt_text.path="<my path>", set the environment variable SGP_SGT_TEXT__PATH="<my path>".

.env Files

Loading directly from a .env file is not currently supported, but you can first source a .env file to load the variables into your environment, and then run the proxy. If using a Docker Compose deployment, you can use env_file to load environment variables from a file.

ProxySettings ¶

Settings for Stained Glass Proxy.

Methods:

Name	Description
`settings_customise_sources`	Define the sources and their order for loading the settings values.

Attributes:

Name	Type	Description
`inference_service_host`	`str`	The hostname of the upstream service.
`sgt_text`	`StainedGlassTransformSettings`	Settings for the Stained Glass Transform for text inputs.
`default_sampling_params`	`DefaultSamplingParams`	Default sampling parameters for generation when not specified in the request.
`min_new_tokens`	`int \| None`	The minimum number of new tokens to generate.
`seed`	`int \| None`	The seed for Stained Glass Transform and inference.
`temperature`	`float`	The default temperature for generation.
`top_p`	`float`	The default top-p value for generation.
`top_k`	`int`	The default top-k value for generation.
`repetition_penalty`	`float`	The default repetition penalty for generation.
`upstream_keep_alive_timeout`	`float`	Timeout for idle connections with the upstream inference server.
`upstream_connector_limit`	`int`	Maximum simultaneous upstream HTTP connections for aiohttp upstream manager.
`session_timeout`	`float`	Timeout for connections with the upstream inference server.
`api_username`	`str \| None`	The username for upstream inference server authentication.
`api_password`	`str \| None`	The password for upstream inference server authentication.
`use_aiohttp_for_upstream`	`bool`	Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput
`upstream_use_chat_completions_for_prompt_embeds`	`bool`	Route transformed prompt embeddings through vLLM's `/v1/chat/completions` endpoint as `prompt_embeds` content parts, instead of
`sagemaker_endpoint_name`	`str \| None`	Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.
`max_input_tokens`	`int \| None`	The maximum number of input tokens to allow in a `/v1/chat/completions` or `/v1/stainedglass` request.
`transform_system_prompt`	`bool \| None`	Whether to apply Stained Glass Transform to system-role messages.
`output_decryption`	`bool`	Whether to decrypt the output from the upstream service.
`ephemeral_key_refresh_time_seconds`	`float`	The time in seconds to refresh the ephemeral key, when output decryption is enabled.
`client_public_key_header_name`	`str`	The name of the HTTP header to use for passing the client public key to the upstream inference server.
`server_public_key_header_name`	`str`	The name of the HTTP header to use for parsing the server public key from the downstream client.
`reconstruction_max_batch_size`	`int \| None`	This can be used to limit the batch size for reconstruction tasks and its memory usage.
`reconstruction_max_sequence_length`	`int \| None`	This can be used to limit the sequence length for reconstruction tasks and its memory usage.
`reconstruction_max_num_embeddings`	`int \| None`	This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.
`tool_parser`	`str \| None`	The tool parser to use. Available parsers can be found with `vllm serve -h \| grep tool-call-parser`. If None, tool calls are not supported.
`custom_chat_template_file`	`str \| None`	Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.
`use_fastokens_tokenizer`	`bool`	Use `fastokens` as an alternative tokenizer by patching `transformers` during startup lifespan.
`default_chat_template_kwargs`	`Annotated[dict[str, Any], NoDecode]`	Default kwargs to render into the chat template. This can be used to inject custom variables into the chat template for use in the SGT.
`profile`	`bool`	Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.
`profile_data_folder`	`Path`	The folder to store profiling data. This is only used if profiling is enabled.
`logging_config_file`	`str`	The logging configuration file.
`license_manager_configuration_file`	`Path`	The path to the license manager configuration file.
`allowed_headers`	`Annotated[list[str], NoDecode]`	List of allowed headers that will be forwarded on to the downstream service.
`cors_allow_origins`	`Annotated[list[str], NoDecode]`	List of allowed origins for CORS. This is expected to be a list of URLs.
`cors_allow_headers`	`Annotated[list[str], NoDecode]`	List of allowed headers for CORS. This is expected to be a list of http headers.
`request_headers_to_add`	`Annotated[dict[str, str] \| None, NoDecode]`	A dictionary of headers to add to each request to the upstream inference service.
`request_headers_to_log`	`Annotated[list[str] \| None, NoDecode]`	List of request headers to include in log messages for each request.
`enable_opentelemetry`	`bool`	Whether to enable OpenTelemetry instrumentation for the proxy.
`otel_excluded_urls`	`list[str]`	Comma separated list of URL paths to exclude from OpenTelemetry instrumentation.

inference_service_host `instance-attribute` ¶

inference_service_host: str

The hostname of the upstream service.

This should include the protocol (http or https) and the port if not the default.

Examples:

http://localhost:8000
https://example.com:443
http://vllm:8080

sgt_text `class-attribute` `instance-attribute` ¶

sgt_text: StainedGlassTransformSettings = Field(
    default_factory=StainedGlassTransformSettings
)

Settings for the Stained Glass Transform for text inputs.

default_sampling_params `class-attribute` `instance-attribute` ¶

default_sampling_params: DefaultSamplingParams = (
    DefaultSamplingParams()
)

Default sampling parameters for generation when not specified in the request.

min_new_tokens `class-attribute` `instance-attribute` ¶

min_new_tokens: int | None = None

The minimum number of new tokens to generate.

seed `class-attribute` `instance-attribute` ¶

seed: int | None = None

The seed for Stained Glass Transform and inference.

temperature `class-attribute` `instance-attribute` ¶

temperature: float = 0.3

The default temperature for generation.

top_p `class-attribute` `instance-attribute` ¶

top_p: float = 0.2

The default top-p value for generation.

top_k `class-attribute` `instance-attribute` ¶

top_k: int = 5000

The default top-k value for generation.

repetition_penalty `class-attribute` `instance-attribute` ¶

repetition_penalty: float = 1.0

The default repetition penalty for generation.

upstream_keep_alive_timeout `class-attribute` `instance-attribute` ¶

upstream_keep_alive_timeout: float = 5

Timeout for idle connections with the upstream inference server.

upstream_connector_limit `class-attribute` `instance-attribute` ¶

upstream_connector_limit: int = Field(default=100, ge=0)

Maximum simultaneous upstream HTTP connections for aiohttp upstream manager.

This maps to aiohttp TCPConnector limit. The default (100) matches aiohttp's default. This value must be greater than or equal to 0. A value of 0 disables the connection limit (i.e., no limit is enforced).

Insufficient connector limits cause two problems

Queueing latency: Requests wait in the proxy's internal queue before acquiring a connection.
Broken pipe risk: Queued requests may reuse pooled connections that are stale or right at the keep-alive timeout boundary. If the upstream closes the connection while it sits in the queue, the request fails with connection errors.

Safety Constraints

Do not exceed OS file descriptor limit (check ulimit -n).
Do not significantly exceed upstream's own concurrency budget.
Memory per connection: Each active connection consumes memory. Monitor and adjust based on available resources and traffic patterns.

session_timeout `class-attribute` `instance-attribute` ¶

session_timeout: float = 60

Timeout for connections with the upstream inference server.

api_username `class-attribute` `instance-attribute` ¶

api_username: str | None = None

The username for upstream inference server authentication.

api_password `class-attribute` `instance-attribute` ¶

api_password: str | None = None

The password for upstream inference server authentication.

use_aiohttp_for_upstream `class-attribute` `instance-attribute` ¶

use_aiohttp_for_upstream: bool = False

Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput when handling many concurrent requests.

upstream_use_chat_completions_for_prompt_embeds `class-attribute` `instance-attribute` ¶

upstream_use_chat_completions_for_prompt_embeds: bool = (
    False
)

Route transformed prompt embeddings through vLLM's /v1/chat/completions endpoint as prompt_embeds content parts, instead of converting to a legacy /v1/completions request.

Requires a vLLM upstream that supports prompt_embeds in chat messages (>= v0.21.0). Leave False for backwards compatibility with older vLLM versions.

sagemaker_endpoint_name `class-attribute` `instance-attribute` ¶

sagemaker_endpoint_name: str | None = None

Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.

When set, inference_service_host must be set to the empty string, otherwise, the proxy will throw an error while loading.

max_input_tokens `class-attribute` `instance-attribute` ¶

max_input_tokens: int | None = None

The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.

Requests with token count greater than this value will be rejected with a 413 error code.

transform_system_prompt `class-attribute` `instance-attribute` ¶

transform_system_prompt: bool | None = None

Whether to apply Stained Glass Transform to system-role messages.

If set to True or False, this setting overrides any per-request value provided in chat completion request extra body. If None, per-request value is used when provided.

output_decryption `class-attribute` `instance-attribute` ¶

output_decryption: bool = False

Whether to decrypt the output from the upstream service.

Warning

This should only be enabled when the upstream inference service uses output encryption.

ephemeral_key_refresh_time_seconds `class-attribute` `instance-attribute` ¶

ephemeral_key_refresh_time_seconds: float = 15 * 60

The time in seconds to refresh the ephemeral key, when output decryption is enabled.

client_public_key_header_name `class-attribute` `instance-attribute` ¶

client_public_key_header_name: str = 'x-client-public-key'

The name of the HTTP header to use for passing the client public key to the upstream inference server.

server_public_key_header_name `class-attribute` `instance-attribute` ¶

server_public_key_header_name: str = 'x-server-public-key'

The name of the HTTP header to use for parsing the server public key from the downstream client.

reconstruction_max_batch_size `class-attribute` `instance-attribute` ¶

reconstruction_max_batch_size: int | None = None

This can be used to limit the batch size for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_sequence_length `class-attribute` `instance-attribute` ¶

reconstruction_max_sequence_length: int | None = None

This can be used to limit the sequence length for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_num_embeddings `class-attribute` `instance-attribute` ¶

reconstruction_max_num_embeddings: int | None = None

This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

tool_parser `class-attribute` `instance-attribute` ¶

tool_parser: str | None = None

The tool parser to use. Available parsers can be found with vllm serve -h | grep tool-call-parser. If None, tool calls are not supported.

custom_chat_template_file `class-attribute` `instance-attribute` ¶

custom_chat_template_file: str | None = None

Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.

use_fastokens_tokenizer `class-attribute` `instance-attribute` ¶

use_fastokens_tokenizer: bool = False

Use fastokens as an alternative tokenizer by patching transformers during startup lifespan.

default_chat_template_kwargs `class-attribute` `instance-attribute` ¶

default_chat_template_kwargs: Annotated[
    dict[str, Any], NoDecode
] = Field(default_factory=dict)

Default kwargs to render into the chat template. This can be used to inject custom variables into the chat template for use in the SGT. This is especially useful to enable or disable thinking for chat templates which support the feature.

Examples:

>>> SGP_DEFAULT_CHAT_TEMPLATE_KWARGS='{"enable_thinking": true, "conversation_id": "abc123"}'
>>> SGP_DEFAULT_CHAT_TEMPLATE_KWARGS='{"enable_thinking": false, "conversation_id": "def456"}'

profile `class-attribute` `instance-attribute` ¶

profile: bool = False

Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.

profile_data_folder `class-attribute` `instance-attribute` ¶

profile_data_folder: Path = Path('profile')

The folder to store profiling data. This is only used if profiling is enabled.

logging_config_file `class-attribute` `instance-attribute` ¶

logging_config_file: str = 'logging.yaml'

The logging configuration file.

license_manager_configuration_file `class-attribute` `instance-attribute` ¶

license_manager_configuration_file: Path = Path(
    "license_manager_configuration.json"
)

The path to the license manager configuration file.

allowed_headers `class-attribute` `instance-attribute` ¶

allowed_headers: Annotated[list[str], NoDecode] = []

List of allowed headers that will be forwarded on to the downstream service.

This is expected to be a comma separated list. ie 'Modal-Key, Modal-Secret'.

cors_allow_origins `class-attribute` `instance-attribute` ¶

cors_allow_origins: Annotated[list[str], NoDecode] = ['*']

List of allowed origins for CORS. This is expected to be a list of URLs. The default is to allow all origins.

cors_allow_headers `class-attribute` `instance-attribute` ¶

cors_allow_headers: Annotated[list[str], NoDecode] = ['*']

List of allowed headers for CORS. This is expected to be a list of http headers. The default is to allow all headers.

request_headers_to_add `class-attribute` `instance-attribute` ¶

request_headers_to_add: Annotated[
    dict[str, str] | None, NoDecode
] = None

A dictionary of headers to add to each request to the upstream inference service.

request_headers_to_log `class-attribute` `instance-attribute` ¶

request_headers_to_log: Annotated[
    list[str] | None, NoDecode
] = None

List of request headers to include in log messages for each request.

This is expected to be a comma separated list. ie 'X-Request-ID, User-Agent'. Only headers present in the request will be logged.

enable_opentelemetry `class-attribute` `instance-attribute` ¶

enable_opentelemetry: bool = False

Whether to enable OpenTelemetry instrumentation for the proxy.

otel_excluded_urls `class-attribute` `instance-attribute` ¶

otel_excluded_urls: list[str] = Field(default_factory=list)

Comma separated list of URL paths to exclude from OpenTelemetry instrumentation.

Note

This setting is dynamically set based on the value of the environment variable OTEL_PYTHON_FASTAPI_EXCLUDED_URLS. If not set, it defaults to an empty list.

settings_customise_sources `classmethod` ¶

settings_customise_sources(
    settings_cls: type[BaseSettings],
    init_settings: PydanticBaseSettingsSource,
    env_settings: PydanticBaseSettingsSource,
    dotenv_settings: PydanticBaseSettingsSource,
    file_secret_settings: PydanticBaseSettingsSource,
) -> tuple[PydanticBaseSettingsSource, ...]

Define the sources and their order for loading the settings values.

Parameters:

Name	Type	Description	Default
`settings_cls`	`type[BaseSettings]`	The Settings class.	required
`init_settings`	`PydanticBaseSettingsSource`	The `InitSettingsSource` instance.	required
`env_settings`	`PydanticBaseSettingsSource`	The `EnvSettingsSource` instance.	required
`dotenv_settings`	`PydanticBaseSettingsSource`	The `DotEnvSettingsSource` instance.	required
`file_secret_settings`	`PydanticBaseSettingsSource`	The `SecretsSettingsSource` instance.	required

Returns:

Type	Description
`tuple[PydanticBaseSettingsSource, ...]`	A tuple containing the sources and their order for loading the settings values.

StainedGlassTransformSettings ¶

Settings related to the Stained Glass Transform itself.

Returned by:

Environment Variables ProxySettings sgt_text

Used by:

Environment Variables ProxySettings sgt_text

Attributes:

Name	Type	Description
`path`	`str`	Path to the Stained Glass Transform file.
`name`	`str \| None`	Optional name to use for the SGT in the `/models` endpoint and as the model id for `/v1/chat/completions` and `/v1/completions` requests.
`device`	`Literal['cpu', 'cuda', 'mps']`	Device that Stained Glass Transform will run on.
`num_workers`	`int`	The number of SGT workers.
`use_cache`	`bool \| None`	The value of `use_cache` to pass to the modeling blocks underlying the SGT mean and std estimator modules.
`tensor_parallel_size`	`int \| None`	Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.
`grace_period_seconds`	`float`	The grace period in seconds to wait for workers to shutdown.
`worker_ready_timeout_seconds`	`float \| None`	The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.
`torch_dtype`	`str \| None`	The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.
`noise_layer_attention`	`SupportedAttentionImplementationsType`	The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.
`third_party_model_path`	`str \| PathLike[str] \| None`	The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends
`trust_remote_code`	`bool`	Whether to trust remote code when loading from HuggingFace Hub.
`compile_noise_layer_forward`	`bool \| None`	Whether to compile the noise layer's forward function.
`override_transform_all_tokens`	`bool \| None`	Override the `transform_all_tokens` setting from the SGT file.
`embedding_compression`	`EmbeddingCompressionStrategy`	Compression algorithm for SGT embeddings sent to the upstream vLLM inference server.
`embedding_compression_bits`	`Annotated[Literal[1, 2, 3, 4, 8], BeforeValidator(int)]`	Quantization bit-width used when `embedding_compression` is `turboquant`.

path `instance-attribute` ¶

path: str

Path to the Stained Glass Transform file.

This can be a path to a .sgt zipfile or a model name on the Hugging Face Hub (such as Protopia/SGT-for-llama-3.1-8b-instruct-rare-rain-bfloat16). Passing a local directory is not supported.

name `class-attribute` `instance-attribute` ¶

name: str | None = None

Optional name to use for the SGT in the /models endpoint and as the model id for /v1/chat/completions and /v1/completions requests.

When this is set, the model can be referred to either as <base model name> or <base model name>/<sgt name>, and both options will be visible from the /models endpoint.

device `class-attribute` `instance-attribute` ¶

device: Literal['cpu', 'cuda', 'mps'] = 'cpu'

Device that Stained Glass Transform will run on.

num_workers `class-attribute` `instance-attribute` ¶

num_workers: int = 1

The number of SGT workers.

Increasing this may improve throughput when handling many concurrent requests, but will also increase memory usage.

use_cache `class-attribute` `instance-attribute` ¶

use_cache: bool | None = None

The value of use_cache to pass to the modeling blocks underlying the SGT mean and std estimator modules.

tensor_parallel_size `class-attribute` `instance-attribute` ¶

tensor_parallel_size: int | None = None

Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.

grace_period_seconds `class-attribute` `instance-attribute` ¶

grace_period_seconds: float = 5

The grace period in seconds to wait for workers to shutdown.

worker_ready_timeout_seconds `class-attribute` `instance-attribute` ¶

worker_ready_timeout_seconds: float | None = None

The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.

torch_dtype `class-attribute` `instance-attribute` ¶

torch_dtype: str | None = None

The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.

For most use cases, we recommend this be "torch.bfloat16". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

noise_layer_attention `class-attribute` `instance-attribute` ¶

noise_layer_attention: SupportedAttentionImplementationsType = None

The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.

For most use cases, we recommend this to be "flash_attention_2". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

Warning

Not all attention mechanisms may be available for all dtypes and devices.

third_party_model_path `class-attribute` `instance-attribute` ¶

third_party_model_path: str | PathLike[str] | None = None

The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends on transformers which are not importable directly through transformers, but are present on the Hugging Face Hub. Typically this should be None or False.

trust_remote_code `class-attribute` `instance-attribute` ¶

trust_remote_code: bool = False

Whether to trust remote code when loading from HuggingFace Hub.

Warning

Enabling this allows execution of arbitrary code from the model repository on Hugging Face Hub. Only enable this for models from trusted sources.

compile_noise_layer_forward `class-attribute` `instance-attribute` ¶

compile_noise_layer_forward: bool | None = None

Whether to compile the noise layer's forward function.

Compiling the forward function can provide around a 2x speedup when using CUDA. On MPS, this feature is still experimental, and enabling it is not recommended until it becomes more stable.

override_transform_all_tokens `class-attribute` `instance-attribute` ¶

override_transform_all_tokens: bool | None = None

Override the transform_all_tokens setting from the SGT file.

"Transform all tokens" refers to whether or not the SGT is applied to all tokens in the input, including the chat template itself, or if it is just applied to user-provided inputs (such as message contents).

When upstream_use_chat_completions_for_prompt_embeds and transform_all_tokens are both True, although the SGT is in fact applied to all tokens, only the user-provided inputs are extracted and sent to the upstream as prompt_embeds content parts. When the chat template is reassembled in the upstream provider, the chat template will effectively be untransformed (but the messages will be still sent as prompt_embeds). Consequently, the transform_all_tokens setting does not really have much of an effect.

This setting is helpful when using the /v1/chat/completions endpoint with upstream_use_chat_completions_for_prompt_embeds=False. In that case, the SGT is applied to the entire prompt, including the chat template, and the entire transformed prompt_embeds are sent to the upstream's /v1/completions endpoint. In that case, the upstream does not do any chat template manipulation, so its inputs are truly transformed before hitting the model.

This setting is also particularly useful when using the /v1/completions endpoint. In that case, the SGT has no way of calculating a noise mask to determine which tokens are user-provided and which are from the chat template, so the SGT will simply transform everything. If the SGT file settings dictate that transform_all_tokens=False, SGT Proxy will throw an error on the request to avoid transforming all of the tokens in an undefined way (since the SGT was trained without transforming chat template tokens). This override setting allows you to still use that SGT with the /v1/completions endpoint, albeit with the caveat that all tokens will indeed be transformed, which may result in worse generation quality if the SGT was not trained with transforming all tokens.

If None, the transform_all_tokens setting from the SGT file will be used. If True or False, this setting will take precedence over the SGT file's setting.

embedding_compression `class-attribute` `instance-attribute` ¶

embedding_compression: EmbeddingCompressionStrategy = NONE

Compression algorithm for SGT embeddings sent to the upstream vLLM inference server.

When set to anything other than none, the proxy encodes embeddings before sending them over the wire. The vLLM server must have the stainedglass_output_protection TurboQuant plugin active (stainedglass_turboquant registered under vllm.general_plugins) to decode them.

embedding_compression_bits `class-attribute` `instance-attribute` ¶

embedding_compression_bits: Annotated[
    Literal[1, 2, 3, 4, 8], BeforeValidator(int)
] = 8

Quantization bit-width used when embedding_compression is turboquant.

Must be one of the bit-widths supported by TurboQuant ({1, 2, 3, 4, 8}). Has no effect when embedding_compression is none.

DefaultSamplingParams ¶

Default sampling parameters for generation when not specified in the request.

Returned by:

Environment Variables ProxySettings default_sampling_params

Used by:

Environment Variables ProxySettings default_sampling_params

Attributes:

Name	Type	Description
`override_max_tokens`	`int \| None`	Hard cap on max_tokens that takes precedence over any value in the request.
`ignore_request_null_max_tokens`	`bool`	Ignore `null` values for `max_tokens`/`max_completion_tokens` in the request, and apply the next inferred value following precedence rules.
`default_max_tokens`	`int \| None`	When a request does not specify max_tokens or max_completion_tokens, this value will be injected into the request sent to the
`allow_determining_default_max_tokens_from_upstream`	`bool`	Whether to try determining the max_tokens from the Upstream Provider.
`allow_determining_default_max_tokens_from_sgt`	`bool`	Whether to try determining the max_tokens from the SGT's tokenizer.

override_max_tokens `class-attribute` `instance-attribute` ¶

override_max_tokens: int | None = None

Hard cap on max_tokens that takes precedence over any value in the request.

This sits at the top of the precedence order and is applied even when the client explicitly provides a max_tokens or max_completion_tokens value, acting as a true operator override rather than a default.

None indicates that no override should be applied. -1 indicates that the overridden max tokens should be unlimited (i.e. "max_tokens" will have a value of null when serialized to json and sent to the upstream inference service).

ignore_request_null_max_tokens `class-attribute` `instance-attribute` ¶

ignore_request_null_max_tokens: bool = False

Ignore null values for max_tokens/max_completion_tokens in the request, and apply the next inferred value following precedence rules.

This can be useful for versions of vLLM that raise an error when max_tokens is explicitly set to null when sent to the /v1/completions endpoint.

default_max_tokens `class-attribute` `instance-attribute` ¶

default_max_tokens: int | None = None

When a request does not specify max_tokens or max_completion_tokens, this value will be injected into the request sent to the upstream inference service.

None indicates that no default should be applied. -1 indicates that the default max tokens should be unlimited (i.e. "max_tokens" will have a value of null when serialized to json and sent to the upstream inference service).

allow_determining_default_max_tokens_from_upstream `class-attribute` `instance-attribute` ¶

allow_determining_default_max_tokens_from_upstream: bool = (
    True
)

Whether to try determining the max_tokens from the Upstream Provider.

If none of the setting override, request, nor the default specify the max tokens value, allow determining the default max tokens value by querying the upstream provider's /v1/models endpoint. If this request is successful, and the provider exposes the model's maximum context window, the default max tokens for a request will be set to (model context window - input token count).

vLLM is the only tested upstream provider for this feature at the moment, but other providers may expose similar functionality.

allow_determining_default_max_tokens_from_sgt `class-attribute` `instance-attribute` ¶

allow_determining_default_max_tokens_from_sgt: bool = True

Whether to try determining the max_tokens from the SGT's tokenizer.

If none of the setting override, request, nor the default specify the max tokens value, and determining default max tokens from upstream is disabled or fails, allow determining the default max tokens value based on the SGT's maximum context window. If this is enabled, and the SGT file specifies a maximum context window, the default max tokens for a request will be set to (SGT max context window - input token count). Note that this is usually an upper bound on the maximum content window of the hosted model. Upstream providers may host the model with a smaller context window.

BACKWARDS_COMPATIBLE_ENV_VARS `module-attribute` ¶

BACKWARDS_COMPATIBLE_ENV_VARS: Final[Mapping[str, str]] = {
    "SGP_SGT_PATH": "SGP_SGT_TEXT__PATH",
    "SGP_SGT_NAME": "SGP_SGT_TEXT__NAME",
    "SGP_DEVICE": "SGP_SGT_TEXT__DEVICE",
    "SGP_NUM_SGT_WORKERS": "SGP_SGT_TEXT__NUM_WORKERS",
    "SGP_SGT_USE_CACHE": "SGP_SGT_TEXT__USE_CACHE",
    "SGP_TENSOR_PARALLEL_SIZE": "SGP_SGT_TEXT__TENSOR_PARALLEL_SIZE",
    "SGP_GRACE_PERIOD_SECONDS": "SGP_SGT_TEXT__GRACE_PERIOD_SECONDS",
    "SGP_WORKER_READY_TIMEOUT_SECONDS": "SGP_SGT_TEXT__WORKER_READY_TIMEOUT_SECONDS",
    "SGP_SGT_TORCH_DTYPE": "SGP_SGT_TEXT__TORCH_DTYPE",
    "SGP_SGT_NOISE_LAYER_ATTENTION": "SGP_SGT_TEXT__NOISE_LAYER_ATTENTION",
    "SGP_SGT_THIRD_PARTY_MODEL_PATH": "SGP_SGT_TEXT__THIRD_PARTY_MODEL_PATH",
    "SGP_SGT_TRUST_REMOTE_CODE": "SGP_SGT_TEXT__TRUST_REMOTE_CODE",
    "SGP_COMPILE_NOISE_LAYER_FORWARD": "SGP_SGT_TEXT__COMPILE_NOISE_LAYER_FORWARD",
}

Mapping of backwards compatible environment variables to their new equivalents.

If an old environment variable is set and the new environment variable is not set, the value from the old environment variable will be used, and a warning will be logged.

If both the old and new environment variables are set, the value from the new environment variable will be used, a warning will be logged, and the old environment variable will be ignored.

If your deployment relies on any of the old environment variables, please update your deployment. The old environment variables are liable to be removed in a future major release.

Environment Variables¶

ProxySettings ¶

inference_service_host instance-attribute ¶

sgt_text class-attribute instance-attribute ¶

default_sampling_params class-attribute instance-attribute ¶

min_new_tokens class-attribute instance-attribute ¶

seed class-attribute instance-attribute ¶

temperature class-attribute instance-attribute ¶

top_p class-attribute instance-attribute ¶

top_k class-attribute instance-attribute ¶

repetition_penalty class-attribute instance-attribute ¶

upstream_keep_alive_timeout class-attribute instance-attribute ¶

upstream_connector_limit class-attribute instance-attribute ¶

session_timeout class-attribute instance-attribute ¶

api_username class-attribute instance-attribute ¶

api_password class-attribute instance-attribute ¶

use_aiohttp_for_upstream class-attribute instance-attribute ¶

upstream_use_chat_completions_for_prompt_embeds class-attribute instance-attribute ¶

sagemaker_endpoint_name class-attribute instance-attribute ¶

max_input_tokens class-attribute instance-attribute ¶

transform_system_prompt class-attribute instance-attribute ¶

output_decryption class-attribute instance-attribute ¶

ephemeral_key_refresh_time_seconds class-attribute instance-attribute ¶

client_public_key_header_name class-attribute instance-attribute ¶

server_public_key_header_name class-attribute instance-attribute ¶

reconstruction_max_batch_size class-attribute instance-attribute ¶

reconstruction_max_sequence_length class-attribute instance-attribute ¶

reconstruction_max_num_embeddings class-attribute instance-attribute ¶

tool_parser class-attribute instance-attribute ¶

custom_chat_template_file class-attribute instance-attribute ¶

use_fastokens_tokenizer class-attribute instance-attribute ¶

default_chat_template_kwargs class-attribute instance-attribute ¶

profile class-attribute instance-attribute ¶

profile_data_folder class-attribute instance-attribute ¶

logging_config_file class-attribute instance-attribute ¶

license_manager_configuration_file class-attribute instance-attribute ¶

allowed_headers class-attribute instance-attribute ¶

cors_allow_origins class-attribute instance-attribute ¶

cors_allow_headers class-attribute instance-attribute ¶

request_headers_to_add class-attribute instance-attribute ¶

request_headers_to_log class-attribute instance-attribute ¶

enable_opentelemetry class-attribute instance-attribute ¶

otel_excluded_urls class-attribute instance-attribute ¶

settings_customise_sources classmethod ¶

StainedGlassTransformSettings ¶

path instance-attribute ¶

name class-attribute instance-attribute ¶

device class-attribute instance-attribute ¶

num_workers class-attribute instance-attribute ¶

use_cache class-attribute instance-attribute ¶

tensor_parallel_size class-attribute instance-attribute ¶

grace_period_seconds class-attribute instance-attribute ¶

worker_ready_timeout_seconds class-attribute instance-attribute ¶

torch_dtype class-attribute instance-attribute ¶

noise_layer_attention class-attribute instance-attribute ¶

third_party_model_path class-attribute instance-attribute ¶

trust_remote_code class-attribute instance-attribute ¶

compile_noise_layer_forward class-attribute instance-attribute ¶

override_transform_all_tokens class-attribute instance-attribute ¶

embedding_compression class-attribute instance-attribute ¶

embedding_compression_bits class-attribute instance-attribute ¶

DefaultSamplingParams ¶

override_max_tokens class-attribute instance-attribute ¶

ignore_request_null_max_tokens class-attribute instance-attribute ¶

default_max_tokens class-attribute instance-attribute ¶

allow_determining_default_max_tokens_from_upstream class-attribute instance-attribute ¶

allow_determining_default_max_tokens_from_sgt class-attribute instance-attribute ¶

BACKWARDS_COMPATIBLE_ENV_VARS module-attribute ¶

inference_service_host `instance-attribute` ¶

sgt_text `class-attribute` `instance-attribute` ¶

default_sampling_params `class-attribute` `instance-attribute` ¶

min_new_tokens `class-attribute` `instance-attribute` ¶

seed `class-attribute` `instance-attribute` ¶

temperature `class-attribute` `instance-attribute` ¶

top_p `class-attribute` `instance-attribute` ¶

top_k `class-attribute` `instance-attribute` ¶

repetition_penalty `class-attribute` `instance-attribute` ¶

upstream_keep_alive_timeout `class-attribute` `instance-attribute` ¶

upstream_connector_limit `class-attribute` `instance-attribute` ¶

session_timeout `class-attribute` `instance-attribute` ¶

api_username `class-attribute` `instance-attribute` ¶

api_password `class-attribute` `instance-attribute` ¶

use_aiohttp_for_upstream `class-attribute` `instance-attribute` ¶

upstream_use_chat_completions_for_prompt_embeds `class-attribute` `instance-attribute` ¶

sagemaker_endpoint_name `class-attribute` `instance-attribute` ¶

max_input_tokens `class-attribute` `instance-attribute` ¶

transform_system_prompt `class-attribute` `instance-attribute` ¶

output_decryption `class-attribute` `instance-attribute` ¶

ephemeral_key_refresh_time_seconds `class-attribute` `instance-attribute` ¶

client_public_key_header_name `class-attribute` `instance-attribute` ¶

server_public_key_header_name `class-attribute` `instance-attribute` ¶

reconstruction_max_batch_size `class-attribute` `instance-attribute` ¶

reconstruction_max_sequence_length `class-attribute` `instance-attribute` ¶

reconstruction_max_num_embeddings `class-attribute` `instance-attribute` ¶

tool_parser `class-attribute` `instance-attribute` ¶

custom_chat_template_file `class-attribute` `instance-attribute` ¶

use_fastokens_tokenizer `class-attribute` `instance-attribute` ¶

default_chat_template_kwargs `class-attribute` `instance-attribute` ¶

profile `class-attribute` `instance-attribute` ¶

profile_data_folder `class-attribute` `instance-attribute` ¶

logging_config_file `class-attribute` `instance-attribute` ¶

license_manager_configuration_file `class-attribute` `instance-attribute` ¶

allowed_headers `class-attribute` `instance-attribute` ¶

cors_allow_origins `class-attribute` `instance-attribute` ¶

cors_allow_headers `class-attribute` `instance-attribute` ¶

request_headers_to_add `class-attribute` `instance-attribute` ¶

request_headers_to_log `class-attribute` `instance-attribute` ¶

enable_opentelemetry `class-attribute` `instance-attribute` ¶

otel_excluded_urls `class-attribute` `instance-attribute` ¶

settings_customise_sources `classmethod` ¶

path `instance-attribute` ¶

name `class-attribute` `instance-attribute` ¶

device `class-attribute` `instance-attribute` ¶

num_workers `class-attribute` `instance-attribute` ¶

use_cache `class-attribute` `instance-attribute` ¶

tensor_parallel_size `class-attribute` `instance-attribute` ¶

grace_period_seconds `class-attribute` `instance-attribute` ¶

worker_ready_timeout_seconds `class-attribute` `instance-attribute` ¶

torch_dtype `class-attribute` `instance-attribute` ¶

noise_layer_attention `class-attribute` `instance-attribute` ¶

third_party_model_path `class-attribute` `instance-attribute` ¶

trust_remote_code `class-attribute` `instance-attribute` ¶

compile_noise_layer_forward `class-attribute` `instance-attribute` ¶

override_transform_all_tokens `class-attribute` `instance-attribute` ¶

embedding_compression `class-attribute` `instance-attribute` ¶

embedding_compression_bits `class-attribute` `instance-attribute` ¶

override_max_tokens `class-attribute` `instance-attribute` ¶

ignore_request_null_max_tokens `class-attribute` `instance-attribute` ¶

default_max_tokens `class-attribute` `instance-attribute` ¶

allow_determining_default_max_tokens_from_upstream `class-attribute` `instance-attribute` ¶

allow_determining_default_max_tokens_from_sgt `class-attribute` `instance-attribute` ¶

BACKWARDS_COMPATIBLE_ENV_VARS `module-attribute` ¶