Skip to content

Environment Variables

Configuration can be set using environment variables. Stained Glass Proxy uses Pydantic Settings to manage parsing environment variables.

Most settings are part of the root ProxySettings class, and those settings can be set using the environment variable names listed in the documentation for that attribute, prefixed by SGP_. For example, to set inference_service_host="http://localhost:9600", set the environment variable SGP_INFERENCE_SERVICE_HOST="http://localhost:9600".

Some attributes use a nested settings object, such as sgt_text. To set those nested settings, use the same prefixing strategy, but also include the name of the nested attribute after a double underscore (__). For example, to set sgt_text.path="<my path>", set the environment variable SGP_SGT_TEXT__PATH="<my path>".

.env Files

Loading directly from a .env file is not currently supported, but you can first source a .env file to load the variables into your environment, and then run the proxy. If using a Docker Compose deployment, you can use env_file to load environment variables from a file.

ProxySettings

Settings for Stained Glass Proxy.

Methods:

Name Description
settings_customise_sources

Define the sources and their order for loading the settings values.

Attributes:

Name Type Description
inference_service_host str

The hostname of the upstream service.

sgt_text StainedGlassTransformSettings

Settings for the Stained Glass Transform for text inputs.

default_sampling_params DefaultSamplingParams

Default sampling parameters for generation when not specified in the request.

min_new_tokens int | None

The minimum number of new tokens to generate.

seed int | None

The seed for Stained Glass Transform and inference.

temperature float

The default temperature for generation.

top_p float

The default top-p value for generation.

top_k int

The default top-k value for generation.

repetition_penalty float

The default repetition penalty for generation.

upstream_keep_alive_timeout float

Timeout for idle connections with the upstream inference server.

upstream_connector_limit int

Maximum simultaneous upstream HTTP connections for aiohttp upstream manager.

session_timeout float

Timeout for connections with the upstream inference server.

api_username str | None

The username for upstream inference server authentication.

api_password str | None

The password for upstream inference server authentication.

use_aiohttp_for_upstream bool

Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput

upstream_use_chat_completions_for_prompt_embeds bool

Route transformed prompt embeddings through vLLM's /v1/chat/completions endpoint as prompt_embeds content parts, instead of

sagemaker_endpoint_name str | None

Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.

max_input_tokens int | None

The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.

transform_system_prompt bool | None

Whether to apply Stained Glass Transform to system-role messages.

output_decryption bool

Whether to decrypt the output from the upstream service.

ephemeral_key_refresh_time_seconds float

The time in seconds to refresh the ephemeral key, when output decryption is enabled.

client_public_key_header_name str

The name of the HTTP header to use for passing the client public key to the upstream inference server.

server_public_key_header_name str

The name of the HTTP header to use for parsing the server public key from the downstream client.

reconstruction_max_batch_size int | None

This can be used to limit the batch size for reconstruction tasks and its memory usage.

reconstruction_max_sequence_length int | None

This can be used to limit the sequence length for reconstruction tasks and its memory usage.

reconstruction_max_num_embeddings int | None

This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.

tool_parser str | None

The tool parser to use. Available parsers can be found with vllm serve -h | grep tool-call-parser. If None, tool calls are not supported.

custom_chat_template_file str | None

Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.

use_fastokens_tokenizer bool

Use fastokens as an alternative tokenizer by patching transformers during startup lifespan.

default_chat_template_kwargs Annotated[dict[str, Any], NoDecode]

Default kwargs to render into the chat template. This can be used to inject custom variables into the chat template for use in the SGT.

profile bool

Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.

profile_data_folder Path

The folder to store profiling data. This is only used if profiling is enabled.

logging_config_file str

The logging configuration file.

license_manager_configuration_file Path

The path to the license manager configuration file.

allowed_headers Annotated[list[str], NoDecode]

List of allowed headers that will be forwarded on to the downstream service.

cors_allow_origins Annotated[list[str], NoDecode]

List of allowed origins for CORS. This is expected to be a list of URLs.

cors_allow_headers Annotated[list[str], NoDecode]

List of allowed headers for CORS. This is expected to be a list of http headers.

request_headers_to_add Annotated[dict[str, str] | None, NoDecode]

A dictionary of headers to add to each request to the upstream inference service.

request_headers_to_log Annotated[list[str] | None, NoDecode]

List of request headers to include in log messages for each request.

enable_opentelemetry bool

Whether to enable OpenTelemetry instrumentation for the proxy.

otel_excluded_urls list[str]

Comma separated list of URL paths to exclude from OpenTelemetry instrumentation.

inference_service_host instance-attribute

inference_service_host: str

The hostname of the upstream service.

This should include the protocol (http or https) and the port if not the default.

Examples:

  • http://localhost:8000
  • https://example.com:443
  • http://vllm:8080

sgt_text class-attribute instance-attribute

sgt_text: StainedGlassTransformSettings = Field(
    default_factory=StainedGlassTransformSettings
)

Settings for the Stained Glass Transform for text inputs.

default_sampling_params class-attribute instance-attribute

default_sampling_params: DefaultSamplingParams = (
    DefaultSamplingParams()
)

Default sampling parameters for generation when not specified in the request.

min_new_tokens class-attribute instance-attribute

min_new_tokens: int | None = None

The minimum number of new tokens to generate.

seed class-attribute instance-attribute

seed: int | None = None

The seed for Stained Glass Transform and inference.

temperature class-attribute instance-attribute

temperature: float = 0.3

The default temperature for generation.

top_p class-attribute instance-attribute

top_p: float = 0.2

The default top-p value for generation.

top_k class-attribute instance-attribute

top_k: int = 5000

The default top-k value for generation.

repetition_penalty class-attribute instance-attribute

repetition_penalty: float = 1.0

The default repetition penalty for generation.

upstream_keep_alive_timeout class-attribute instance-attribute

upstream_keep_alive_timeout: float = 5

Timeout for idle connections with the upstream inference server.

upstream_connector_limit class-attribute instance-attribute

upstream_connector_limit: int = Field(default=100, ge=0)

Maximum simultaneous upstream HTTP connections for aiohttp upstream manager.

This maps to aiohttp TCPConnector limit. The default (100) matches aiohttp's default. This value must be greater than or equal to 0. A value of 0 disables the connection limit (i.e., no limit is enforced).

Insufficient connector limits cause two problems
  1. Queueing latency: Requests wait in the proxy's internal queue before acquiring a connection.
  2. Broken pipe risk: Queued requests may reuse pooled connections that are stale or right at the keep-alive timeout boundary. If the upstream closes the connection while it sits in the queue, the request fails with connection errors.
Safety Constraints
  • Do not exceed OS file descriptor limit (check ulimit -n).
  • Do not significantly exceed upstream's own concurrency budget.
  • Memory per connection: Each active connection consumes memory. Monitor and adjust based on available resources and traffic patterns.

session_timeout class-attribute instance-attribute

session_timeout: float = 60

Timeout for connections with the upstream inference server.

api_username class-attribute instance-attribute

api_username: str | None = None

The username for upstream inference server authentication.

api_password class-attribute instance-attribute

api_password: str | None = None

The password for upstream inference server authentication.

use_aiohttp_for_upstream class-attribute instance-attribute

use_aiohttp_for_upstream: bool = False

Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput when handling many concurrent requests.

upstream_use_chat_completions_for_prompt_embeds class-attribute instance-attribute

upstream_use_chat_completions_for_prompt_embeds: bool = (
    False
)

Route transformed prompt embeddings through vLLM's /v1/chat/completions endpoint as prompt_embeds content parts, instead of converting to a legacy /v1/completions request.

Requires a vLLM upstream that supports prompt_embeds in chat messages (>= v0.21.0). Leave False for backwards compatibility with older vLLM versions.

sagemaker_endpoint_name class-attribute instance-attribute

sagemaker_endpoint_name: str | None = None

Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.

When set, inference_service_host must be set to the empty string, otherwise, the proxy will throw an error while loading.

max_input_tokens class-attribute instance-attribute

max_input_tokens: int | None = None

The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.

Requests with token count greater than this value will be rejected with a 413 error code.

transform_system_prompt class-attribute instance-attribute

transform_system_prompt: bool | None = None

Whether to apply Stained Glass Transform to system-role messages.

If set to True or False, this setting overrides any per-request value provided in chat completion request extra body. If None, per-request value is used when provided.

output_decryption class-attribute instance-attribute

output_decryption: bool = False

Whether to decrypt the output from the upstream service.

Warning

This should only be enabled when the upstream inference service uses output encryption.

ephemeral_key_refresh_time_seconds class-attribute instance-attribute

ephemeral_key_refresh_time_seconds: float = 15 * 60

The time in seconds to refresh the ephemeral key, when output decryption is enabled.

client_public_key_header_name class-attribute instance-attribute

client_public_key_header_name: str = 'x-client-public-key'

The name of the HTTP header to use for passing the client public key to the upstream inference server.

server_public_key_header_name class-attribute instance-attribute

server_public_key_header_name: str = 'x-server-public-key'

The name of the HTTP header to use for parsing the server public key from the downstream client.

reconstruction_max_batch_size class-attribute instance-attribute

reconstruction_max_batch_size: int | None = None

This can be used to limit the batch size for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_sequence_length class-attribute instance-attribute

reconstruction_max_sequence_length: int | None = None

This can be used to limit the sequence length for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_num_embeddings class-attribute instance-attribute

reconstruction_max_num_embeddings: int | None = None

This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

tool_parser class-attribute instance-attribute

tool_parser: str | None = None

The tool parser to use. Available parsers can be found with vllm serve -h | grep tool-call-parser. If None, tool calls are not supported.

custom_chat_template_file class-attribute instance-attribute

custom_chat_template_file: str | None = None

Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.

use_fastokens_tokenizer class-attribute instance-attribute

use_fastokens_tokenizer: bool = False

Use fastokens as an alternative tokenizer by patching transformers during startup lifespan.

default_chat_template_kwargs class-attribute instance-attribute

default_chat_template_kwargs: Annotated[
    dict[str, Any], NoDecode
] = Field(default_factory=dict)

Default kwargs to render into the chat template. This can be used to inject custom variables into the chat template for use in the SGT. This is especially useful to enable or disable thinking for chat templates which support the feature.

Examples:

>>> SGP_DEFAULT_CHAT_TEMPLATE_KWARGS='{"enable_thinking": true, "conversation_id": "abc123"}'
>>> SGP_DEFAULT_CHAT_TEMPLATE_KWARGS='{"enable_thinking": false, "conversation_id": "def456"}'

profile class-attribute instance-attribute

profile: bool = False

Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.

profile_data_folder class-attribute instance-attribute

profile_data_folder: Path = Path('profile')

The folder to store profiling data. This is only used if profiling is enabled.

logging_config_file class-attribute instance-attribute

logging_config_file: str = 'logging.yaml'

The logging configuration file.

license_manager_configuration_file class-attribute instance-attribute

license_manager_configuration_file: Path = Path(
    "license_manager_configuration.json"
)

The path to the license manager configuration file.

allowed_headers class-attribute instance-attribute

allowed_headers: Annotated[list[str], NoDecode] = []

List of allowed headers that will be forwarded on to the downstream service.

This is expected to be a comma separated list. ie 'Modal-Key, Modal-Secret'.

cors_allow_origins class-attribute instance-attribute

cors_allow_origins: Annotated[list[str], NoDecode] = ['*']

List of allowed origins for CORS. This is expected to be a list of URLs. The default is to allow all origins.

cors_allow_headers class-attribute instance-attribute

cors_allow_headers: Annotated[list[str], NoDecode] = ['*']

List of allowed headers for CORS. This is expected to be a list of http headers. The default is to allow all headers.

request_headers_to_add class-attribute instance-attribute

request_headers_to_add: Annotated[
    dict[str, str] | None, NoDecode
] = None

A dictionary of headers to add to each request to the upstream inference service.

request_headers_to_log class-attribute instance-attribute

request_headers_to_log: Annotated[
    list[str] | None, NoDecode
] = None

List of request headers to include in log messages for each request.

This is expected to be a comma separated list. ie 'X-Request-ID, User-Agent'. Only headers present in the request will be logged.

enable_opentelemetry class-attribute instance-attribute

enable_opentelemetry: bool = False

Whether to enable OpenTelemetry instrumentation for the proxy.

otel_excluded_urls class-attribute instance-attribute

otel_excluded_urls: list[str] = Field(default_factory=list)

Comma separated list of URL paths to exclude from OpenTelemetry instrumentation.

Note

This setting is dynamically set based on the value of the environment variable OTEL_PYTHON_FASTAPI_EXCLUDED_URLS. If not set, it defaults to an empty list.

settings_customise_sources classmethod

settings_customise_sources(
    settings_cls: type[BaseSettings],
    init_settings: PydanticBaseSettingsSource,
    env_settings: PydanticBaseSettingsSource,
    dotenv_settings: PydanticBaseSettingsSource,
    file_secret_settings: PydanticBaseSettingsSource,
) -> tuple[PydanticBaseSettingsSource, ...]

Define the sources and their order for loading the settings values.

Parameters:

Name Type Description Default
settings_cls type[BaseSettings]

The Settings class.

required
init_settings PydanticBaseSettingsSource

The InitSettingsSource instance.

required
env_settings PydanticBaseSettingsSource

The EnvSettingsSource instance.

required
dotenv_settings PydanticBaseSettingsSource

The DotEnvSettingsSource instance.

required
file_secret_settings PydanticBaseSettingsSource

The SecretsSettingsSource instance.

required

Returns:

Type Description
tuple[PydanticBaseSettingsSource, ...]

A tuple containing the sources and their order for loading the settings values.

StainedGlassTransformSettings

Settings related to the Stained Glass Transform itself.

Attributes:

Name Type Description
path str

Path to the Stained Glass Transform file.

name str | None

Optional name to use for the SGT in the /models endpoint and as the model id for /v1/chat/completions and /v1/completions requests.

device Literal['cpu', 'cuda', 'mps']

Device that Stained Glass Transform will run on.

num_workers int

The number of SGT workers.

use_cache bool | None

The value of use_cache to pass to the modeling blocks underlying the SGT mean and std estimator modules.

tensor_parallel_size int | None

Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.

grace_period_seconds float

The grace period in seconds to wait for workers to shutdown.

worker_ready_timeout_seconds float | None

The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.

torch_dtype str | None

The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.

noise_layer_attention SupportedAttentionImplementationsType

The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.

third_party_model_path str | PathLike[str] | None

The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends

trust_remote_code bool

Whether to trust remote code when loading from HuggingFace Hub.

compile_noise_layer_forward bool | None

Whether to compile the noise layer's forward function.

override_transform_all_tokens bool | None

Override the transform_all_tokens setting from the SGT file.

embedding_compression EmbeddingCompressionStrategy

Compression algorithm for SGT embeddings sent to the upstream vLLM inference server.

embedding_compression_bits Annotated[Literal[1, 2, 3, 4, 8], BeforeValidator(int)]

Quantization bit-width used when embedding_compression is turboquant.

path instance-attribute

path: str

Path to the Stained Glass Transform file.

This can be a path to a .sgt zipfile or a model name on the Hugging Face Hub (such as Protopia/SGT-for-llama-3.1-8b-instruct-rare-rain-bfloat16). Passing a local directory is not supported.

name class-attribute instance-attribute

name: str | None = None

Optional name to use for the SGT in the /models endpoint and as the model id for /v1/chat/completions and /v1/completions requests.

When this is set, the model can be referred to either as <base model name> or <base model name>/<sgt name>, and both options will be visible from the /models endpoint.

device class-attribute instance-attribute

device: Literal['cpu', 'cuda', 'mps'] = 'cpu'

Device that Stained Glass Transform will run on.

num_workers class-attribute instance-attribute

num_workers: int = 1

The number of SGT workers.

Increasing this may improve throughput when handling many concurrent requests, but will also increase memory usage.

use_cache class-attribute instance-attribute

use_cache: bool | None = None

The value of use_cache to pass to the modeling blocks underlying the SGT mean and std estimator modules.

tensor_parallel_size class-attribute instance-attribute

tensor_parallel_size: int | None = None

Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.

grace_period_seconds class-attribute instance-attribute

grace_period_seconds: float = 5

The grace period in seconds to wait for workers to shutdown.

worker_ready_timeout_seconds class-attribute instance-attribute

worker_ready_timeout_seconds: float | None = None

The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.

torch_dtype class-attribute instance-attribute

torch_dtype: str | None = None

The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.

For most use cases, we recommend this be "torch.bfloat16". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

noise_layer_attention class-attribute instance-attribute

noise_layer_attention: SupportedAttentionImplementationsType = None

The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.

For most use cases, we recommend this to be "flash_attention_2". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

Warning

Not all attention mechanisms may be available for all dtypes and devices.

third_party_model_path class-attribute instance-attribute

third_party_model_path: str | PathLike[str] | None = None

The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends on transformers which are not importable directly through transformers, but are present on the Hugging Face Hub. Typically this should be None or False.

trust_remote_code class-attribute instance-attribute

trust_remote_code: bool = False

Whether to trust remote code when loading from HuggingFace Hub.

Warning

Enabling this allows execution of arbitrary code from the model repository on Hugging Face Hub. Only enable this for models from trusted sources.

compile_noise_layer_forward class-attribute instance-attribute

compile_noise_layer_forward: bool | None = None

Whether to compile the noise layer's forward function.

Compiling the forward function can provide around a 2x speedup when using CUDA. On MPS, this feature is still experimental, and enabling it is not recommended until it becomes more stable.

override_transform_all_tokens class-attribute instance-attribute

override_transform_all_tokens: bool | None = None

Override the transform_all_tokens setting from the SGT file.

"Transform all tokens" refers to whether or not the SGT is applied to all tokens in the input, including the chat template itself, or if it is just applied to user-provided inputs (such as message contents).

When upstream_use_chat_completions_for_prompt_embeds and transform_all_tokens are both True, although the SGT is in fact applied to all tokens, only the user-provided inputs are extracted and sent to the upstream as prompt_embeds content parts. When the chat template is reassembled in the upstream provider, the chat template will effectively be untransformed (but the messages will be still sent as prompt_embeds). Consequently, the transform_all_tokens setting does not really have much of an effect.

This setting is helpful when using the /v1/chat/completions endpoint with upstream_use_chat_completions_for_prompt_embeds=False. In that case, the SGT is applied to the entire prompt, including the chat template, and the entire transformed prompt_embeds are sent to the upstream's /v1/completions endpoint. In that case, the upstream does not do any chat template manipulation, so its inputs are truly transformed before hitting the model.

This setting is also particularly useful when using the /v1/completions endpoint. In that case, the SGT has no way of calculating a noise mask to determine which tokens are user-provided and which are from the chat template, so the SGT will simply transform everything. If the SGT file settings dictate that transform_all_tokens=False, SGT Proxy will throw an error on the request to avoid transforming all of the tokens in an undefined way (since the SGT was trained without transforming chat template tokens). This override setting allows you to still use that SGT with the /v1/completions endpoint, albeit with the caveat that all tokens will indeed be transformed, which may result in worse generation quality if the SGT was not trained with transforming all tokens.

If None, the transform_all_tokens setting from the SGT file will be used. If True or False, this setting will take precedence over the SGT file's setting.

embedding_compression class-attribute instance-attribute

embedding_compression: EmbeddingCompressionStrategy = NONE

Compression algorithm for SGT embeddings sent to the upstream vLLM inference server.

When set to anything other than none, the proxy encodes embeddings before sending them over the wire. The vLLM server must have the stainedglass_output_protection TurboQuant plugin active (stainedglass_turboquant registered under vllm.general_plugins) to decode them.

embedding_compression_bits class-attribute instance-attribute

embedding_compression_bits: Annotated[
    Literal[1, 2, 3, 4, 8], BeforeValidator(int)
] = 8

Quantization bit-width used when embedding_compression is turboquant.

Must be one of the bit-widths supported by TurboQuant ({1, 2, 3, 4, 8}). Has no effect when embedding_compression is none.

DefaultSamplingParams

Default sampling parameters for generation when not specified in the request.

Attributes:

Name Type Description
override_max_tokens int | None

Hard cap on max_tokens that takes precedence over any value in the request.

ignore_request_null_max_tokens bool

Ignore null values for max_tokens/max_completion_tokens in the request, and apply the next inferred value following precedence rules.

default_max_tokens int | None

When a request does not specify max_tokens or max_completion_tokens, this value will be injected into the request sent to the

allow_determining_default_max_tokens_from_upstream bool

Whether to try determining the max_tokens from the Upstream Provider.

allow_determining_default_max_tokens_from_sgt bool

Whether to try determining the max_tokens from the SGT's tokenizer.

override_max_tokens class-attribute instance-attribute

override_max_tokens: int | None = None

Hard cap on max_tokens that takes precedence over any value in the request.

This sits at the top of the precedence order and is applied even when the client explicitly provides a max_tokens or max_completion_tokens value, acting as a true operator override rather than a default.

None indicates that no override should be applied. -1 indicates that the overridden max tokens should be unlimited (i.e. "max_tokens" will have a value of null when serialized to json and sent to the upstream inference service).

ignore_request_null_max_tokens class-attribute instance-attribute

ignore_request_null_max_tokens: bool = False

Ignore null values for max_tokens/max_completion_tokens in the request, and apply the next inferred value following precedence rules.

This can be useful for versions of vLLM that raise an error when max_tokens is explicitly set to null when sent to the /v1/completions endpoint.

default_max_tokens class-attribute instance-attribute

default_max_tokens: int | None = None

When a request does not specify max_tokens or max_completion_tokens, this value will be injected into the request sent to the upstream inference service.

None indicates that no default should be applied. -1 indicates that the default max tokens should be unlimited (i.e. "max_tokens" will have a value of null when serialized to json and sent to the upstream inference service).

allow_determining_default_max_tokens_from_upstream class-attribute instance-attribute

allow_determining_default_max_tokens_from_upstream: bool = (
    True
)

Whether to try determining the max_tokens from the Upstream Provider.

If none of the setting override, request, nor the default specify the max tokens value, allow determining the default max tokens value by querying the upstream provider's /v1/models endpoint. If this request is successful, and the provider exposes the model's maximum context window, the default max tokens for a request will be set to (model context window - input token count).

vLLM is the only tested upstream provider for this feature at the moment, but other providers may expose similar functionality.

allow_determining_default_max_tokens_from_sgt class-attribute instance-attribute

allow_determining_default_max_tokens_from_sgt: bool = True

Whether to try determining the max_tokens from the SGT's tokenizer.

If none of the setting override, request, nor the default specify the max tokens value, and determining default max tokens from upstream is disabled or fails, allow determining the default max tokens value based on the SGT's maximum context window. If this is enabled, and the SGT file specifies a maximum context window, the default max tokens for a request will be set to (SGT max context window - input token count). Note that this is usually an upper bound on the maximum content window of the hosted model. Upstream providers may host the model with a smaller context window.

BACKWARDS_COMPATIBLE_ENV_VARS module-attribute

BACKWARDS_COMPATIBLE_ENV_VARS: Final[Mapping[str, str]] = {
    "SGP_SGT_PATH": "SGP_SGT_TEXT__PATH",
    "SGP_SGT_NAME": "SGP_SGT_TEXT__NAME",
    "SGP_DEVICE": "SGP_SGT_TEXT__DEVICE",
    "SGP_NUM_SGT_WORKERS": "SGP_SGT_TEXT__NUM_WORKERS",
    "SGP_SGT_USE_CACHE": "SGP_SGT_TEXT__USE_CACHE",
    "SGP_TENSOR_PARALLEL_SIZE": "SGP_SGT_TEXT__TENSOR_PARALLEL_SIZE",
    "SGP_GRACE_PERIOD_SECONDS": "SGP_SGT_TEXT__GRACE_PERIOD_SECONDS",
    "SGP_WORKER_READY_TIMEOUT_SECONDS": "SGP_SGT_TEXT__WORKER_READY_TIMEOUT_SECONDS",
    "SGP_SGT_TORCH_DTYPE": "SGP_SGT_TEXT__TORCH_DTYPE",
    "SGP_SGT_NOISE_LAYER_ATTENTION": "SGP_SGT_TEXT__NOISE_LAYER_ATTENTION",
    "SGP_SGT_THIRD_PARTY_MODEL_PATH": "SGP_SGT_TEXT__THIRD_PARTY_MODEL_PATH",
    "SGP_SGT_TRUST_REMOTE_CODE": "SGP_SGT_TEXT__TRUST_REMOTE_CODE",
    "SGP_COMPILE_NOISE_LAYER_FORWARD": "SGP_SGT_TEXT__COMPILE_NOISE_LAYER_FORWARD",
}

Mapping of backwards compatible environment variables to their new equivalents.

If an old environment variable is set and the new environment variable is not set, the value from the old environment variable will be used, and a warning will be logged.

If both the old and new environment variables are set, the value from the new environment variable will be used, a warning will be logged, and the old environment variable will be ignored.

If your deployment relies on any of the old environment variables, please update your deployment. The old environment variables are liable to be removed in a future major release.