Environment Variables¶
Configuration can be set using environment variables. Stained Glass Proxy uses Pydantic Settings to manage parsing environment variables.
Most settings are part of the root ProxySettings class, and those settings can be set using the environment variable names listed in the documentation for that attribute, prefixed by SGP_. For example, to set inference_service_host="http://localhost:9600", set the environment variable SGP_INFERENCE_SERVICE_HOST="http://localhost:9600".
Some attributes use a nested settings object, such as sgt_text. To set those nested settings, use the same prefixing strategy, but also include the name of the nested attribute after a double underscore (__). For example, to set sgt_text.path="<my path>", set the environment variable SGP_SGT_TEXT__PATH="<my path>".
.env Files
Loading directly from a .env file is not currently supported, but you can first source a .env file to load the variables into your environment, and then run the proxy. If using a Docker Compose deployment, you can use env_file to load environment variables from a file.
ProxySettings
¶
Settings for Stained Glass Proxy.
Methods:
| Name | Description |
|---|---|
settings_customise_sources |
Define the sources and their order for loading the settings values. |
Attributes:
| Name | Type | Description |
|---|---|---|
inference_service_host |
str
|
The hostname of the upstream service. |
sgt_text |
StainedGlassTransformSettings
|
Settings for the Stained Glass Transform for text inputs. |
default_sampling_params |
DefaultSamplingParams
|
Default sampling parameters for generation when not specified in the request. |
min_new_tokens |
int | None
|
The minimum number of new tokens to generate. |
seed |
int | None
|
The seed for Stained Glass Transform and inference. |
temperature |
float
|
The default temperature for generation. |
top_p |
float
|
The default top-p value for generation. |
top_k |
int
|
The default top-k value for generation. |
repetition_penalty |
float
|
The default repetition penalty for generation. |
upstream_keep_alive_timeout |
float
|
Timeout for idle connections with the upstream inference server. |
upstream_connector_limit |
int
|
Maximum simultaneous upstream HTTP connections for aiohttp upstream manager. |
session_timeout |
float
|
Timeout for connections with the upstream inference server. |
api_username |
str | None
|
The username for upstream inference server authentication. |
api_password |
str | None
|
The password for upstream inference server authentication. |
use_aiohttp_for_upstream |
bool
|
Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput |
upstream_use_chat_completions_for_prompt_embeds |
bool
|
Route transformed prompt embeddings through vLLM's |
sagemaker_endpoint_name |
str | None
|
Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint. |
max_input_tokens |
int | None
|
The maximum number of input tokens to allow in a |
transform_system_prompt |
bool | None
|
Whether to apply Stained Glass Transform to system-role messages. |
output_decryption |
bool
|
Whether to decrypt the output from the upstream service. |
ephemeral_key_refresh_time_seconds |
float
|
The time in seconds to refresh the ephemeral key, when output decryption is enabled. |
client_public_key_header_name |
str
|
The name of the HTTP header to use for passing the client public key to the upstream inference server. |
server_public_key_header_name |
str
|
The name of the HTTP header to use for parsing the server public key from the downstream client. |
reconstruction_max_batch_size |
int | None
|
This can be used to limit the batch size for reconstruction tasks and its memory usage. |
reconstruction_max_sequence_length |
int | None
|
This can be used to limit the sequence length for reconstruction tasks and its memory usage. |
reconstruction_max_num_embeddings |
int | None
|
This can be used to limit the number of embeddings for reconstruction tasks and its memory usage. |
tool_parser |
str | None
|
The tool parser to use. Available parsers can be found with |
custom_chat_template_file |
str | None
|
Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template. |
use_fastokens_tokenizer |
bool
|
Use |
default_chat_template_kwargs |
Annotated[dict[str, Any], NoDecode]
|
Default kwargs to render into the chat template. This can be used to inject custom variables into the chat template for use in the SGT. |
profile |
bool
|
Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist. |
profile_data_folder |
Path
|
The folder to store profiling data. This is only used if profiling is enabled. |
logging_config_file |
str
|
The logging configuration file. |
license_manager_configuration_file |
Path
|
The path to the license manager configuration file. |
allowed_headers |
Annotated[list[str], NoDecode]
|
List of allowed headers that will be forwarded on to the downstream service. |
cors_allow_origins |
Annotated[list[str], NoDecode]
|
List of allowed origins for CORS. This is expected to be a list of URLs. |
cors_allow_headers |
Annotated[list[str], NoDecode]
|
List of allowed headers for CORS. This is expected to be a list of http headers. |
request_headers_to_add |
Annotated[dict[str, str] | None, NoDecode]
|
A dictionary of headers to add to each request to the upstream inference service. |
request_headers_to_log |
Annotated[list[str] | None, NoDecode]
|
List of request headers to include in log messages for each request. |
enable_opentelemetry |
bool
|
Whether to enable OpenTelemetry instrumentation for the proxy. |
otel_excluded_urls |
list[str]
|
Comma separated list of URL paths to exclude from OpenTelemetry instrumentation. |
inference_service_host
instance-attribute
¶
inference_service_host: str
The hostname of the upstream service.
This should include the protocol (http or https) and the port if not the default.
Examples:
- http://localhost:8000
- https://example.com:443
- http://vllm:8080
sgt_text
class-attribute
instance-attribute
¶
sgt_text: StainedGlassTransformSettings = Field(
default_factory=StainedGlassTransformSettings
)
Settings for the Stained Glass Transform for text inputs.
default_sampling_params
class-attribute
instance-attribute
¶
default_sampling_params: DefaultSamplingParams = (
DefaultSamplingParams()
)
Default sampling parameters for generation when not specified in the request.
min_new_tokens
class-attribute
instance-attribute
¶
min_new_tokens: int | None = None
The minimum number of new tokens to generate.
seed
class-attribute
instance-attribute
¶
seed: int | None = None
The seed for Stained Glass Transform and inference.
temperature
class-attribute
instance-attribute
¶
temperature: float = 0.3
The default temperature for generation.
top_p
class-attribute
instance-attribute
¶
top_p: float = 0.2
The default top-p value for generation.
top_k
class-attribute
instance-attribute
¶
top_k: int = 5000
The default top-k value for generation.
repetition_penalty
class-attribute
instance-attribute
¶
repetition_penalty: float = 1.0
The default repetition penalty for generation.
upstream_keep_alive_timeout
class-attribute
instance-attribute
¶
upstream_keep_alive_timeout: float = 5
Timeout for idle connections with the upstream inference server.
upstream_connector_limit
class-attribute
instance-attribute
¶
upstream_connector_limit: int = Field(default=100, ge=0)
Maximum simultaneous upstream HTTP connections for aiohttp upstream manager.
This maps to aiohttp TCPConnector limit. The default (100) matches aiohttp's default.
This value must be greater than or equal to 0. A value of 0 disables the connection limit (i.e., no limit is enforced).
Insufficient connector limits cause two problems
- Queueing latency: Requests wait in the proxy's internal queue before acquiring a connection.
- Broken pipe risk: Queued requests may reuse pooled connections that are stale or right at the keep-alive timeout boundary. If the upstream closes the connection while it sits in the queue, the request fails with connection errors.
Safety Constraints
- Do not exceed OS file descriptor limit (check
ulimit -n). - Do not significantly exceed upstream's own concurrency budget.
- Memory per connection: Each active connection consumes memory. Monitor and adjust based on available resources and traffic patterns.
session_timeout
class-attribute
instance-attribute
¶
session_timeout: float = 60
Timeout for connections with the upstream inference server.
api_username
class-attribute
instance-attribute
¶
api_username: str | None = None
The username for upstream inference server authentication.
api_password
class-attribute
instance-attribute
¶
api_password: str | None = None
The password for upstream inference server authentication.
use_aiohttp_for_upstream
class-attribute
instance-attribute
¶
use_aiohttp_for_upstream: bool = False
Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput when handling many concurrent requests.
upstream_use_chat_completions_for_prompt_embeds
class-attribute
instance-attribute
¶
upstream_use_chat_completions_for_prompt_embeds: bool = (
False
)
Route transformed prompt embeddings through vLLM's /v1/chat/completions endpoint as prompt_embeds content parts, instead of
converting to a legacy /v1/completions request.
Requires a vLLM upstream that supports prompt_embeds in chat messages (>= v0.21.0). Leave False for backwards compatibility with
older vLLM versions.
sagemaker_endpoint_name
class-attribute
instance-attribute
¶
sagemaker_endpoint_name: str | None = None
Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.
When set, inference_service_host must be set to the empty string, otherwise, the proxy will throw an error while loading.
max_input_tokens
class-attribute
instance-attribute
¶
max_input_tokens: int | None = None
The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.
Requests with token count greater than this value will be rejected with a 413 error code.
transform_system_prompt
class-attribute
instance-attribute
¶
transform_system_prompt: bool | None = None
Whether to apply Stained Glass Transform to system-role messages.
If set to True or False, this setting overrides any per-request value provided in
chat completion request extra body. If None, per-request value is used when provided.
output_decryption
class-attribute
instance-attribute
¶
output_decryption: bool = False
Whether to decrypt the output from the upstream service.
Warning
This should only be enabled when the upstream inference service uses output encryption.
ephemeral_key_refresh_time_seconds
class-attribute
instance-attribute
¶
ephemeral_key_refresh_time_seconds: float = 15 * 60
The time in seconds to refresh the ephemeral key, when output decryption is enabled.
client_public_key_header_name
class-attribute
instance-attribute
¶
client_public_key_header_name: str = 'x-client-public-key'
The name of the HTTP header to use for passing the client public key to the upstream inference server.
server_public_key_header_name
class-attribute
instance-attribute
¶
server_public_key_header_name: str = 'x-server-public-key'
The name of the HTTP header to use for parsing the server public key from the downstream client.
reconstruction_max_batch_size
class-attribute
instance-attribute
¶
reconstruction_max_batch_size: int | None = None
This can be used to limit the batch size for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint,
but may also increase the processing time.
reconstruction_max_sequence_length
class-attribute
instance-attribute
¶
reconstruction_max_sequence_length: int | None = None
This can be used to limit the sequence length for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint,
but may also increase the processing time.
reconstruction_max_num_embeddings
class-attribute
instance-attribute
¶
reconstruction_max_num_embeddings: int | None = None
This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint,
but may also increase the processing time.
tool_parser
class-attribute
instance-attribute
¶
tool_parser: str | None = None
The tool parser to use. Available parsers can be found with vllm serve -h | grep tool-call-parser. If None, tool calls are not supported.
custom_chat_template_file
class-attribute
instance-attribute
¶
custom_chat_template_file: str | None = None
Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.
use_fastokens_tokenizer
class-attribute
instance-attribute
¶
use_fastokens_tokenizer: bool = False
Use fastokens as an alternative tokenizer by patching transformers during startup lifespan.
default_chat_template_kwargs
class-attribute
instance-attribute
¶
Default kwargs to render into the chat template. This can be used to inject custom variables into the chat template for use in the SGT. This is especially useful to enable or disable thinking for chat templates which support the feature.
Examples:
profile
class-attribute
instance-attribute
¶
profile: bool = False
Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.
profile_data_folder
class-attribute
instance-attribute
¶
The folder to store profiling data. This is only used if profiling is enabled.
logging_config_file
class-attribute
instance-attribute
¶
logging_config_file: str = 'logging.yaml'
The logging configuration file.
license_manager_configuration_file
class-attribute
instance-attribute
¶
The path to the license manager configuration file.
allowed_headers
class-attribute
instance-attribute
¶
List of allowed headers that will be forwarded on to the downstream service.
This is expected to be a comma separated list. ie 'Modal-Key, Modal-Secret'.
cors_allow_origins
class-attribute
instance-attribute
¶
List of allowed origins for CORS. This is expected to be a list of URLs. The default is to allow all origins.
cors_allow_headers
class-attribute
instance-attribute
¶
List of allowed headers for CORS. This is expected to be a list of http headers. The default is to allow all headers.
request_headers_to_add
class-attribute
instance-attribute
¶
A dictionary of headers to add to each request to the upstream inference service.
request_headers_to_log
class-attribute
instance-attribute
¶
List of request headers to include in log messages for each request.
This is expected to be a comma separated list. ie 'X-Request-ID, User-Agent'. Only headers present in the request will be logged.
enable_opentelemetry
class-attribute
instance-attribute
¶
enable_opentelemetry: bool = False
Whether to enable OpenTelemetry instrumentation for the proxy.
otel_excluded_urls
class-attribute
instance-attribute
¶
Comma separated list of URL paths to exclude from OpenTelemetry instrumentation.
Note
This setting is dynamically set based on the value of the environment variable
OTEL_PYTHON_FASTAPI_EXCLUDED_URLS. If not set, it defaults to an empty list.
settings_customise_sources
classmethod
¶
settings_customise_sources(
settings_cls: type[BaseSettings],
init_settings: PydanticBaseSettingsSource,
env_settings: PydanticBaseSettingsSource,
dotenv_settings: PydanticBaseSettingsSource,
file_secret_settings: PydanticBaseSettingsSource,
) -> tuple[PydanticBaseSettingsSource, ...]
Define the sources and their order for loading the settings values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
settings_cls
|
type[BaseSettings]
|
The Settings class. |
required |
init_settings
|
PydanticBaseSettingsSource
|
The |
required |
env_settings
|
PydanticBaseSettingsSource
|
The |
required |
dotenv_settings
|
PydanticBaseSettingsSource
|
The |
required |
file_secret_settings
|
PydanticBaseSettingsSource
|
The |
required |
Returns:
| Type | Description |
|---|---|
tuple[PydanticBaseSettingsSource, ...]
|
A tuple containing the sources and their order for loading the settings values. |
StainedGlassTransformSettings
¶
Settings related to the Stained Glass Transform itself.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
str
|
Path to the Stained Glass Transform file. |
name |
str | None
|
Optional name to use for the SGT in the |
device |
Literal['cpu', 'cuda', 'mps']
|
Device that Stained Glass Transform will run on. |
num_workers |
int
|
The number of SGT workers. |
use_cache |
bool | None
|
The value of |
tensor_parallel_size |
int | None
|
Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism. |
grace_period_seconds |
float
|
The grace period in seconds to wait for workers to shutdown. |
worker_ready_timeout_seconds |
float | None
|
The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely. |
torch_dtype |
str | None
|
The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in. |
noise_layer_attention |
SupportedAttentionImplementationsType
|
The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file. |
third_party_model_path |
str | PathLike[str] | None
|
The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends |
trust_remote_code |
bool
|
Whether to trust remote code when loading from HuggingFace Hub. |
compile_noise_layer_forward |
bool | None
|
Whether to compile the noise layer's forward function. |
override_transform_all_tokens |
bool | None
|
Override the |
embedding_compression |
EmbeddingCompressionStrategy
|
Compression algorithm for SGT embeddings sent to the upstream vLLM inference server. |
embedding_compression_bits |
Annotated[Literal[1, 2, 3, 4, 8], BeforeValidator(int)]
|
Quantization bit-width used when |
path
instance-attribute
¶
path: str
Path to the Stained Glass Transform file.
This can be a path to a .sgt zipfile or a model name on the Hugging Face Hub
(such as Protopia/SGT-for-llama-3.1-8b-instruct-rare-rain-bfloat16). Passing a local directory is not supported.
name
class-attribute
instance-attribute
¶
name: str | None = None
Optional name to use for the SGT in the /models endpoint and as the model id for /v1/chat/completions and /v1/completions requests.
When this is set, the model can be referred to either as <base model name> or <base model name>/<sgt name>, and both options will be visible from the /models endpoint.
device
class-attribute
instance-attribute
¶
device: Literal['cpu', 'cuda', 'mps'] = 'cpu'
Device that Stained Glass Transform will run on.
num_workers
class-attribute
instance-attribute
¶
num_workers: int = 1
The number of SGT workers.
Increasing this may improve throughput when handling many concurrent requests, but will also increase memory usage.
use_cache
class-attribute
instance-attribute
¶
use_cache: bool | None = None
The value of use_cache to pass to the modeling blocks underlying the SGT mean and std estimator modules.
tensor_parallel_size
class-attribute
instance-attribute
¶
tensor_parallel_size: int | None = None
Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.
grace_period_seconds
class-attribute
instance-attribute
¶
grace_period_seconds: float = 5
The grace period in seconds to wait for workers to shutdown.
worker_ready_timeout_seconds
class-attribute
instance-attribute
¶
worker_ready_timeout_seconds: float | None = None
The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.
torch_dtype
class-attribute
instance-attribute
¶
torch_dtype: str | None = None
The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.
For most use cases, we recommend this be "torch.bfloat16". If you are seeing unexpectedly large memory consumption, try explicitly
setting this option.
noise_layer_attention
class-attribute
instance-attribute
¶
The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.
For most use cases, we recommend this to be "flash_attention_2". If you are seeing unexpectedly large memory consumption, try explicitly
setting this option.
Warning
Not all attention mechanisms may be available for all dtypes and devices.
third_party_model_path
class-attribute
instance-attribute
¶
The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends
on transformers which are not importable directly through transformers, but are present on the Hugging Face Hub. Typically this should
be None or False.
trust_remote_code
class-attribute
instance-attribute
¶
trust_remote_code: bool = False
Whether to trust remote code when loading from HuggingFace Hub.
Warning
Enabling this allows execution of arbitrary code from the model repository on Hugging Face Hub. Only enable this for models from trusted sources.
compile_noise_layer_forward
class-attribute
instance-attribute
¶
compile_noise_layer_forward: bool | None = None
Whether to compile the noise layer's forward function.
Compiling the forward function can provide around a 2x speedup when using CUDA. On MPS, this feature is still experimental, and enabling it is not recommended until it becomes more stable.
override_transform_all_tokens
class-attribute
instance-attribute
¶
override_transform_all_tokens: bool | None = None
Override the transform_all_tokens setting from the SGT file.
"Transform all tokens" refers to whether or not the SGT is applied to all tokens in the input, including the chat template itself, or if it is just applied to user-provided inputs (such as message contents).
When upstream_use_chat_completions_for_prompt_embeds and transform_all_tokens are both True, although the SGT is in fact
applied to all tokens, only the user-provided inputs are extracted and sent to the upstream as prompt_embeds content parts. When
the chat template is reassembled in the upstream provider, the chat template will effectively be untransformed (but the messages will
be still sent as prompt_embeds). Consequently, the transform_all_tokens setting does not really have much of an effect.
This setting is helpful when using the /v1/chat/completions endpoint with upstream_use_chat_completions_for_prompt_embeds=False.
In that case, the SGT is applied to the entire prompt, including the chat template, and the entire transformed prompt_embeds are sent
to the upstream's /v1/completions endpoint. In that case, the upstream does not do any chat template manipulation, so its inputs are
truly transformed before hitting the model.
This setting is also particularly useful when using the /v1/completions endpoint. In that case, the SGT has no way of calculating
a noise mask to determine which tokens are user-provided and which are from the chat template, so the SGT will simply transform
everything. If the SGT file settings dictate that transform_all_tokens=False, SGT Proxy will throw an error on the request to avoid
transforming all of the tokens in an undefined way (since the SGT was trained without transforming chat template tokens). This
override setting allows you to still use that SGT with the /v1/completions endpoint, albeit with the caveat that all tokens will
indeed be transformed, which may result in worse generation quality if the SGT was not trained with transforming all tokens.
If None, the transform_all_tokens setting from the SGT file will be used.
If True or False, this setting will take precedence over the SGT file's setting.
embedding_compression
class-attribute
instance-attribute
¶
Compression algorithm for SGT embeddings sent to the upstream vLLM inference server.
When set to anything other than none, the proxy encodes embeddings before sending them
over the wire. The vLLM server must have the stainedglass_output_protection TurboQuant
plugin active (stainedglass_turboquant registered under vllm.general_plugins) to
decode them.
embedding_compression_bits
class-attribute
instance-attribute
¶
Quantization bit-width used when embedding_compression is turboquant.
Must be one of the bit-widths supported by TurboQuant ({1, 2, 3, 4, 8}). Has no effect
when embedding_compression is none.
DefaultSamplingParams
¶
Default sampling parameters for generation when not specified in the request.
Attributes:
| Name | Type | Description |
|---|---|---|
override_max_tokens |
int | None
|
Hard cap on max_tokens that takes precedence over any value in the request. |
ignore_request_null_max_tokens |
bool
|
Ignore |
default_max_tokens |
int | None
|
When a request does not specify max_tokens or max_completion_tokens, this value will be injected into the request sent to the |
allow_determining_default_max_tokens_from_upstream |
bool
|
Whether to try determining the max_tokens from the Upstream Provider. |
allow_determining_default_max_tokens_from_sgt |
bool
|
Whether to try determining the max_tokens from the SGT's tokenizer. |
override_max_tokens
class-attribute
instance-attribute
¶
override_max_tokens: int | None = None
Hard cap on max_tokens that takes precedence over any value in the request.
This sits at the top of the precedence order and is applied even when the client explicitly provides a max_tokens or max_completion_tokens value, acting as a true operator override rather than a default.
None indicates that no override should be applied. -1 indicates that the overridden max tokens should be unlimited (i.e. "max_tokens" will have a value of null when serialized to json and sent to the upstream inference service).
ignore_request_null_max_tokens
class-attribute
instance-attribute
¶
ignore_request_null_max_tokens: bool = False
Ignore null values for max_tokens/max_completion_tokens in the request, and apply the next inferred value following precedence rules.
This can be useful for versions of vLLM that raise an error when max_tokens is explicitly set to null when sent to the
/v1/completions endpoint.
default_max_tokens
class-attribute
instance-attribute
¶
default_max_tokens: int | None = None
When a request does not specify max_tokens or max_completion_tokens, this value will be injected into the request sent to the upstream inference service.
None indicates that no default should be applied. -1 indicates that the default max tokens should be unlimited (i.e. "max_tokens" will have a value of null when serialized to json and sent to the upstream inference service).
allow_determining_default_max_tokens_from_upstream
class-attribute
instance-attribute
¶
allow_determining_default_max_tokens_from_upstream: bool = (
True
)
Whether to try determining the max_tokens from the Upstream Provider.
If none of the setting override, request, nor the default specify the max tokens value, allow determining the default max tokens
value by querying the upstream provider's /v1/models endpoint. If this request is successful, and the provider exposes the model's
maximum context window, the default max tokens for a request will be set to (model context window - input token count).
vLLM is the only tested upstream provider for this feature at the moment, but other providers may expose similar functionality.
allow_determining_default_max_tokens_from_sgt
class-attribute
instance-attribute
¶
allow_determining_default_max_tokens_from_sgt: bool = True
Whether to try determining the max_tokens from the SGT's tokenizer.
If none of the setting override, request, nor the default specify the max tokens value, and determining default max tokens from upstream is disabled or fails, allow determining the default max tokens value based on the SGT's maximum context window. If this is enabled, and the SGT file specifies a maximum context window, the default max tokens for a request will be set to (SGT max context window - input token count). Note that this is usually an upper bound on the maximum content window of the hosted model. Upstream providers may host the model with a smaller context window.
BACKWARDS_COMPATIBLE_ENV_VARS
module-attribute
¶
BACKWARDS_COMPATIBLE_ENV_VARS: Final[Mapping[str, str]] = {
"SGP_SGT_PATH": "SGP_SGT_TEXT__PATH",
"SGP_SGT_NAME": "SGP_SGT_TEXT__NAME",
"SGP_DEVICE": "SGP_SGT_TEXT__DEVICE",
"SGP_NUM_SGT_WORKERS": "SGP_SGT_TEXT__NUM_WORKERS",
"SGP_SGT_USE_CACHE": "SGP_SGT_TEXT__USE_CACHE",
"SGP_TENSOR_PARALLEL_SIZE": "SGP_SGT_TEXT__TENSOR_PARALLEL_SIZE",
"SGP_GRACE_PERIOD_SECONDS": "SGP_SGT_TEXT__GRACE_PERIOD_SECONDS",
"SGP_WORKER_READY_TIMEOUT_SECONDS": "SGP_SGT_TEXT__WORKER_READY_TIMEOUT_SECONDS",
"SGP_SGT_TORCH_DTYPE": "SGP_SGT_TEXT__TORCH_DTYPE",
"SGP_SGT_NOISE_LAYER_ATTENTION": "SGP_SGT_TEXT__NOISE_LAYER_ATTENTION",
"SGP_SGT_THIRD_PARTY_MODEL_PATH": "SGP_SGT_TEXT__THIRD_PARTY_MODEL_PATH",
"SGP_SGT_TRUST_REMOTE_CODE": "SGP_SGT_TEXT__TRUST_REMOTE_CODE",
"SGP_COMPILE_NOISE_LAYER_FORWARD": "SGP_SGT_TEXT__COMPILE_NOISE_LAYER_FORWARD",
}
Mapping of backwards compatible environment variables to their new equivalents.
If an old environment variable is set and the new environment variable is not set, the value from the old environment variable will be used, and a warning will be logged.
If both the old and new environment variables are set, the value from the new environment variable will be used, a warning will be logged, and the old environment variable will be ignored.
If your deployment relies on any of the old environment variables, please update your deployment. The old environment variables are liable to be removed in a future major release.