Skip to content

Environment Variables

ProxySettings

Settings for Stained Glass Proxy.

Note

Any of these can be set via environment variables with the prefix SGP_. For example, to set inference_service_host="http://localhost", set the environment variable SGP_INFERENCE_SERVICE_HOST="http://localhost".

Attributes:

Name Type Description
inference_service_host str

The hostname of the upstream service.

sgt_path str

Path to the Stained Glass Transform file.

min_new_tokens int | None

The minimum number of new tokens to generate.

seed int | None

The seed for Stained Glass Transform and inference.

temperature float

The default temperature for generation.

top_p float

The default top-p value for generation.

top_k int

The default top-k value for generation.

repetition_penalty float

The default repetition penalty for generation.

upstream_keep_alive_timeout float

Timeout for idle connections with the upstream inference server.

session_timeout float

Timeout for connections with the upstream inference server.

api_username str | None

The username for upstream inference server authentication.

api_password str | None

The password for upstream inference server authentication.

use_aiohttp_for_upstream bool

Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput

sagemaker_endpoint_name str | None

Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.

device Literal['cpu', 'cuda']

Device that Stained Glass Transform will run on.

num_sgt_workers int

The number of SGT workers.

tensor_parallel_size int | None

Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.

grace_period_seconds int

The grace period in seconds to wait for workers to shutdown.

worker_ready_timeout_seconds float | None

The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.

sgt_torch_dtype str | None

The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.

sgt_noise_layer_attention Literal['sdpa', 'flash_attention_2', 'flash_attention_3', 'flex_attention'] | None

The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.

max_input_tokens int | None

The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.

output_decryption bool

Whether to decrypt the output from the upstream service.

ephemeral_key_refresh_time_seconds float

The time in seconds to refresh the ephemeral key, when output decryption is enabled.

reconstruction_max_batch_size int | None

This can be used to limit the batch size for reconstruction tasks and its memory usage.

reconstruction_max_sequence_length int | None

This can be used to limit the sequence length for reconstruction tasks and its memory usage.

reconstruction_max_num_embeddings int | None

This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.

profile bool

Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.

profile_data_folder Path

The folder to store profiling data. This is only used if profiling is enabled.

logging_config_file str

The logging configuration file.

license_manager_configuration_file Path

The path to the license manager configuration file.

allowed_headers set[str] | None

List of allowed headers that will be forwarded on to the downstream service.

inference_service_host instance-attribute

inference_service_host: str

The hostname of the upstream service.

This should include the protocol (http or https) and the port if not the default.

Examples:

  • http://localhost:8000
  • https://example.com:443
  • http://vllm:8080

sgt_path instance-attribute

sgt_path: str

Path to the Stained Glass Transform file.

This should be a file created with Stained Glass Engine using stainedglass_core.transform.StainedGlassTransformForText.save_pretrained.

min_new_tokens class-attribute instance-attribute

min_new_tokens: int | None = None

The minimum number of new tokens to generate.

seed class-attribute instance-attribute

seed: int | None = None

The seed for Stained Glass Transform and inference.

temperature class-attribute instance-attribute

temperature: float = 0.3

The default temperature for generation.

top_p class-attribute instance-attribute

top_p: float = 0.2

The default top-p value for generation.

top_k class-attribute instance-attribute

top_k: int = 5000

The default top-k value for generation.

repetition_penalty class-attribute instance-attribute

repetition_penalty: float = 1.0

The default repetition penalty for generation.

upstream_keep_alive_timeout class-attribute instance-attribute

upstream_keep_alive_timeout: float = 5

Timeout for idle connections with the upstream inference server.

session_timeout class-attribute instance-attribute

session_timeout: float = 60

Timeout for connections with the upstream inference server.

api_username class-attribute instance-attribute

api_username: str | None = None

The username for upstream inference server authentication.

api_password class-attribute instance-attribute

api_password: str | None = None

The password for upstream inference server authentication.

use_aiohttp_for_upstream class-attribute instance-attribute

use_aiohttp_for_upstream: bool = False

Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput when handling many concurrent requests.

sagemaker_endpoint_name class-attribute instance-attribute

sagemaker_endpoint_name: str | None = None

Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.

When set, inference_service_host must be set to the empty string, otherwise, the proxy will throw an error while loading.

device class-attribute instance-attribute

device: Literal['cpu', 'cuda'] = 'cpu'

Device that Stained Glass Transform will run on.

Note

When set to cuda, the cuda:0 device is used unless tensor_parallel_size is set, in which case the tensor parallel settings will take precedence. To avoid using the cuda:0 device in non-tensor parallel environments, we recommend setting the CUDA_VISIBLE_DEVICES environment variable. For example, if you would like the proxy to use cuda:1 instead of cuda:0, you would also set the CUDA_VISIBLE_DEVICES=1 environment variable.

num_sgt_workers class-attribute instance-attribute

num_sgt_workers: int = 1

The number of SGT workers.

Increasing this may improve throughput when handling many concurrent requests, but will also increase memory usage.

tensor_parallel_size class-attribute instance-attribute

tensor_parallel_size: int | None = None

Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.

grace_period_seconds class-attribute instance-attribute

grace_period_seconds: int = 5

The grace period in seconds to wait for workers to shutdown.

worker_ready_timeout_seconds class-attribute instance-attribute

worker_ready_timeout_seconds: float | None = None

The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.

sgt_torch_dtype class-attribute instance-attribute

sgt_torch_dtype: str | None = None

The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.

For most use cases, we recommend this be "torch.bfloat16". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

sgt_noise_layer_attention class-attribute instance-attribute

sgt_noise_layer_attention: (
    Literal[
        "sdpa",
        "flash_attention_2",
        "flash_attention_3",
        "flex_attention",
    ]
    | None
) = None

The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.

For most use cases, we recommend this to be "flash_attention_2". If you are seeing unexpectedly large memory consumption, try explicitly setting this option.

Warning

Not all attention mechanisms may be available for all dtypes and devices.

max_input_tokens class-attribute instance-attribute

max_input_tokens: int | None = None

The maximum number of input tokens to allow in a /v1/chat/completions or /v1/stainedglass request.

Requests with token count greater than this value will be rejected with a 413 error code.

output_decryption class-attribute instance-attribute

output_decryption: bool = False

Whether to decrypt the output from the upstream service.

Warning

This should only be enabled when the upstream inference service uses output encryption.

ephemeral_key_refresh_time_seconds class-attribute instance-attribute

ephemeral_key_refresh_time_seconds: float = 15 * 60

The time in seconds to refresh the ephemeral key, when output decryption is enabled.

reconstruction_max_batch_size class-attribute instance-attribute

reconstruction_max_batch_size: int | None = None

This can be used to limit the batch size for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_sequence_length class-attribute instance-attribute

reconstruction_max_sequence_length: int | None = None

This can be used to limit the sequence length for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

reconstruction_max_num_embeddings class-attribute instance-attribute

reconstruction_max_num_embeddings: int | None = None

This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.

Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass endpoint, but may also increase the processing time.

profile class-attribute instance-attribute

profile: bool = False

Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.

profile_data_folder class-attribute instance-attribute

profile_data_folder: Path = Path('profile')

The folder to store profiling data. This is only used if profiling is enabled.

logging_config_file class-attribute instance-attribute

logging_config_file: str = 'logging.yaml'

The logging configuration file.

license_manager_configuration_file class-attribute instance-attribute

license_manager_configuration_file: Path = Path(
    "license_manager_configuration.json"
)

The path to the license manager configuration file.

allowed_headers class-attribute instance-attribute

allowed_headers: set[str] | None = None

List of allowed headers that will be forwarded on to the downstream service.

This is expected to be a comma separated list. ie 'Modal-Key, Modal-Secret'.