Skip to content

Environment Variables

ProxySettings pydantic-model

Settings for Stained Glass Proxy.

Note

Any of these can be set via environment variables with the prefix SGP_. For example, to set inference_service_host="http://localhost", set the environment variable SGP_INFERENCE_SERVICE_HOST="http://localhost".

Config:

  • env_prefix: SGP_

Fields:

Validators:

inference_service_host pydantic-field

inference_service_host: str

The hostname of the upstream service.

This should include the protocol (http or https) and the port if not the default.

Examples:

  • http://localhost:8000
  • https://example.com:443
  • http://vllm:8080

sgt_path pydantic-field

sgt_path: str

Path to the Stained Glass Transform file.

This can be a path to a .sgt zipfile or a model name on the Hugging Face Hub (such as Protopia/SGT-for-llama-3.1-8b-instruct-rare-rain-bfloat16). Passing a local directory is not supported.

sgt_name pydantic-field

sgt_name: str | None = None

Optional name to use for the SGT in the /models endpoint and as the model id for /v1/chat/completions and /v1/completions requests.

When this is set, the model can be referred to either as <base model name> or <base model name>/<sgt name>, and both options will be visible from the /models endpoint.

Note

If set this will override the SGT name stored in the SGT file.

min_new_tokens pydantic-field

min_new_tokens: int | None = None

The minimum number of new tokens to generate.

seed pydantic-field

seed: int | None = None

The seed for Stained Glass Transform and inference.

temperature pydantic-field

temperature: float = 0.3

The default temperature for generation.

top_p pydantic-field

top_p: float = 0.2

The default top-p value for generation.

top_k pydantic-field

top_k: int = 5000

The default top-k value for generation.

repetition_penalty pydantic-field

repetition_penalty: float = 1.0

The default repetition penalty for generation.

upstream_keep_alive_timeout pydantic-field

upstream_keep_alive_timeout: float = 5

Timeout for idle connections with the upstream inference server.

session_timeout pydantic-field

session_timeout: float = 60

Timeout for connections with the upstream inference server.

api_username pydantic-field

api_username: str | None = None

The username for upstream inference server authentication.

api_password pydantic-field

api_password: str | None = None

The password for upstream inference server authentication.

use_aiohttp_for_upstream pydantic-field

use_aiohttp_for_upstream: bool = False

Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput when handling many concurrent requests.

sagemaker_endpoint_name pydantic-field

sagemaker_endpoint_name: str | None = None

Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.

When set, inference_service_host must be set to the empty string, otherwise, the proxy will throw an error while loading.

device pydantic-field

device: Literal['cpu', 'cuda', 'mps'] = 'cpu'

Device that Stained Glass Transform will run on.

Note

When set to cuda, the cuda:0 device is used unless tensor_parallel_size is set, in which case the tensor parallel settings will take precedence. To avoid using the cuda:0 device in non-tensor parallel environments, we recommend setting the CUDA_VISIBLE_DEVICES environment variable. For example, if you would like the proxy to use cuda:1 instead of cuda:0, you would also set the CUDA_VISIBLE_DEVICES=1 environment variable.

Warning

"mps" is only supported on Apple Silicon Macs. This support is considered experimental.

num_sgt_workers pydantic-field

num_sgt_workers: int = 1

The number of SGT workers.

Increasing this may improve throughput when handling many concurrent requests, but will also increase memory usage.

compile_noise_layer_forward pydantic-field

compile_noise_layer_forward: bool | None = None

Whether to compile the noise layer's forward function.

Compiling the forward function can provide around a 2x speedup when using CUDA. On MPS, this feature is still experimental, and enabling it is not recommended until it becomes more stable.

tensor_parallel_size pydantic-field

tensor_parallel_size: int | None = None

Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.

grace_period_seconds pydantic-field

grace_period_seconds: float = 5

The grace period in seconds to wait for workers to shutdown.

worker_ready_timeout_seconds pydantic-field

worker_ready_timeout_seconds: float | None = None

The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.

sgt_torch_dtype pydantic-field

sgt_torch_dtype: str | None = None

The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.

For most use cases, we recommend this be "torch.bfloat16". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

sgt_noise_layer_attention pydantic-field

sgt_noise_layer_attention: SupportedAttentionImplementationsType = None

The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.

For most use cases, we recommend this to be "flash_attention_2". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

Warning

Not all attention mechanisms may be available for all dtypes and devices.

max_input_tokens pydantic-field

max_input_tokens: int | None = None

The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.

Requests with token count greater than this value will be rejected with a 413 error code.

sgt_third_party_model_path pydantic-field

sgt_third_party_model_path: str | PathLike[str] | None = (
    None
)

The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends on transformers which are not importable directly through transformers, but are present on the Hugging Face Hub. Typically this should be None or False.

sgt_trust_remote_code pydantic-field

sgt_trust_remote_code: bool = False

Whether to trust remote code when loading from HuggingFace Hub.

Warning

Enabling this allows execution of arbitrary code from the model repository on Hugging Face Hub. Only enable this for models from trusted sources.

output_decryption pydantic-field

output_decryption: bool = False

Whether to decrypt the output from the upstream service.

Warning

This should only be enabled when the upstream inference service uses output encryption.

ephemeral_key_refresh_time_seconds pydantic-field

ephemeral_key_refresh_time_seconds: float = 15 * 60

The time in seconds to refresh the ephemeral key, when output decryption is enabled.

client_public_key_header_name pydantic-field

client_public_key_header_name: str = 'x-client-public-key'

The name of the HTTP header to use for passing the client public key to the upstream inference server.

server_public_key_header_name pydantic-field

server_public_key_header_name: str = 'x-server-public-key'

The name of the HTTP header to use for parsing the server public key from the downstream client.

reconstruction_max_batch_size pydantic-field

reconstruction_max_batch_size: int | None = None

This can be used to limit the batch size for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_sequence_length pydantic-field

reconstruction_max_sequence_length: int | None = None

This can be used to limit the sequence length for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_num_embeddings pydantic-field

reconstruction_max_num_embeddings: int | None = None

This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

tool_parser pydantic-field

tool_parser: str | None = None

The tool parser to use. Available parsers can be found with vllm serve -h | grep tool-call-parser. If None, tool calls are not supported.

custom_chat_template_file pydantic-field

custom_chat_template_file: str | None = None

Path to a file containing a custom Jinja chat template. If provided, this takes precedence over custom_chat_template.

profile pydantic-field

profile: bool = False

Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.

profile_data_folder pydantic-field

profile_data_folder: Path

The folder to store profiling data. This is only used if profiling is enabled.

logging_config_file pydantic-field

logging_config_file: str = 'logging.yaml'

The logging configuration file.

license_manager_configuration_file pydantic-field

license_manager_configuration_file: Path

The path to the license manager configuration file.

allowed_headers pydantic-field

allowed_headers: list[str] = []

List of allowed headers that will be forwarded on to the downstream service.

This is expected to be a comma separated list. ie 'Modal-Key, Modal-Secret'.

cors_allow_origins pydantic-field

cors_allow_origins: list[str] = ['*']

List of allowed origins for CORS. This is expected to be a list of URLs. The default is to allow all origins.

cors_allow_headers pydantic-field

cors_allow_headers: list[str] = ['*']

List of allowed headers for CORS. This is expected to be a list of http headers. The default is to allow all headers.

request_headers_to_add pydantic-field

request_headers_to_add: dict[str, str] | None = None

A dictionary of headers to add to each request to the upstream inference service.

request_headers_to_log pydantic-field

request_headers_to_log: list[str] | None = None

List of request headers to include in log messages for each request.

This is expected to be a comma separated list. ie 'X-Request-ID, User-Agent'. Only headers present in the request will be logged.

enable_opentelemetry pydantic-field

enable_opentelemetry: bool = False

Whether to enable OpenTelemetry instrumentation for the proxy.

otel_excluded_urls pydantic-field

otel_excluded_urls: list[str]

Comma separated list of URL paths to exclude from OpenTelemetry instrumentation.

Note

This setting is dynamically set based on the value of the environment variable OTEL_PYTHON_FASTAPI_EXCLUDED_URLS. If not set, it defaults to an empty list.