Environment Variables¶
ProxySettings
¶
Settings for Stained Glass Proxy.
Note
Any of these can be set via environment variables with the prefix SGP_
. For example, to set
inference_service_host="http://localhost"
, set the environment variable SGP_INFERENCE_SERVICE_HOST="http://localhost"
.
Attributes:
Name | Type | Description |
---|---|---|
inference_service_host |
str
|
The hostname of the upstream service. |
sgt_path |
str
|
Path to the Stained Glass Transform file. |
min_new_tokens |
int | None
|
The minimum number of new tokens to generate. |
seed |
int | None
|
The seed for Stained Glass Transform and inference. |
temperature |
float
|
The default temperature for generation. |
top_p |
float
|
The default top-p value for generation. |
top_k |
int
|
The default top-k value for generation. |
repetition_penalty |
float
|
The default repetition penalty for generation. |
upstream_keep_alive_timeout |
float
|
Timeout for idle connections with the upstream inference server. |
session_timeout |
float
|
Timeout for connections with the upstream inference server. |
api_username |
str | None
|
The username for upstream inference server authentication. |
api_password |
str | None
|
The password for upstream inference server authentication. |
use_aiohttp_for_upstream |
bool
|
Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput |
sagemaker_endpoint_name |
str | None
|
Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint. |
device |
Literal['cpu', 'cuda']
|
Device that Stained Glass Transform will run on. |
num_sgt_workers |
int
|
The number of SGT workers. |
tensor_parallel_size |
int | None
|
Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism. |
grace_period_seconds |
int
|
The grace period in seconds to wait for workers to shutdown. |
worker_ready_timeout_seconds |
float | None
|
The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely. |
sgt_torch_dtype |
str | None
|
The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in. |
sgt_noise_layer_attention |
Literal['sdpa', 'flash_attention_2', 'flash_attention_3', 'flex_attention'] | None
|
The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file. |
max_input_tokens |
int | None
|
The maximum number of input tokens to allow in a |
output_decryption |
bool
|
Whether to decrypt the output from the upstream service. |
ephemeral_key_refresh_time_seconds |
float
|
The time in seconds to refresh the ephemeral key, when output decryption is enabled. |
reconstruction_max_batch_size |
int | None
|
This can be used to limit the batch size for reconstruction tasks and its memory usage. |
reconstruction_max_sequence_length |
int | None
|
This can be used to limit the sequence length for reconstruction tasks and its memory usage. |
reconstruction_max_num_embeddings |
int | None
|
This can be used to limit the number of embeddings for reconstruction tasks and its memory usage. |
profile |
bool
|
Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist. |
profile_data_folder |
Path
|
The folder to store profiling data. This is only used if profiling is enabled. |
logging_config_file |
str
|
The logging configuration file. |
license_manager_configuration_file |
Path
|
The path to the license manager configuration file. |
allowed_headers |
set[str] | None
|
List of allowed headers that will be forwarded on to the downstream service. |
inference_service_host
instance-attribute
¶
inference_service_host: str
The hostname of the upstream service.
This should include the protocol (http or https) and the port if not the default.
Examples:
- http://localhost:8000
- https://example.com:443
- http://vllm:8080
sgt_path
instance-attribute
¶
sgt_path: str
Path to the Stained Glass Transform file.
This should be a file created with Stained Glass Engine using
stainedglass_core.transform.StainedGlassTransformForText.save_pretrained
.
min_new_tokens
class-attribute
instance-attribute
¶
min_new_tokens: int | None = None
The minimum number of new tokens to generate.
seed
class-attribute
instance-attribute
¶
seed: int | None = None
The seed for Stained Glass Transform and inference.
temperature
class-attribute
instance-attribute
¶
temperature: float = 0.3
The default temperature for generation.
top_p
class-attribute
instance-attribute
¶
top_p: float = 0.2
The default top-p value for generation.
top_k
class-attribute
instance-attribute
¶
top_k: int = 5000
The default top-k value for generation.
repetition_penalty
class-attribute
instance-attribute
¶
repetition_penalty: float = 1.0
The default repetition penalty for generation.
upstream_keep_alive_timeout
class-attribute
instance-attribute
¶
upstream_keep_alive_timeout: float = 5
Timeout for idle connections with the upstream inference server.
session_timeout
class-attribute
instance-attribute
¶
session_timeout: float = 60
Timeout for connections with the upstream inference server.
api_username
class-attribute
instance-attribute
¶
api_username: str | None = None
The username for upstream inference server authentication.
api_password
class-attribute
instance-attribute
¶
api_password: str | None = None
The password for upstream inference server authentication.
use_aiohttp_for_upstream
class-attribute
instance-attribute
¶
use_aiohttp_for_upstream: bool = False
Prefer using aiohttp over httpx to manage http connections with the upstream inference server. This may have improved throughput when handling many concurrent requests.
sagemaker_endpoint_name
class-attribute
instance-attribute
¶
sagemaker_endpoint_name: str | None = None
Name of AWS SageMaker AI endpoint to use as the upstream LLM Inference server. Should be None unless using a SageMaker endpoint.
When set, inference_service_host
must be set to the empty string, otherwise, the proxy will throw an error while loading.
device
class-attribute
instance-attribute
¶
device: Literal['cpu', 'cuda'] = 'cpu'
Device that Stained Glass Transform will run on.
Note
When set to cuda
, the cuda:0
device is used unless tensor_parallel_size
is set, in which case the tensor parallel settings
will take precedence. To avoid using the cuda:0
device in non-tensor parallel environments, we recommend setting the
CUDA_VISIBLE_DEVICES
environment variable.
For example, if you would like the proxy to use cuda:1
instead of cuda:0
, you would also set the CUDA_VISIBLE_DEVICES=1
environment variable.
num_sgt_workers
class-attribute
instance-attribute
¶
num_sgt_workers: int = 1
The number of SGT workers.
Increasing this may improve throughput when handling many concurrent requests, but will also increase memory usage.
tensor_parallel_size
class-attribute
instance-attribute
¶
tensor_parallel_size: int | None = None
Tensor parallel size (must be a power of 2, min 2, max num_gpus). If None, disables tensor parallelism.
grace_period_seconds
class-attribute
instance-attribute
¶
grace_period_seconds: int = 5
The grace period in seconds to wait for workers to shutdown.
worker_ready_timeout_seconds
class-attribute
instance-attribute
¶
worker_ready_timeout_seconds: float | None = None
The timeout in seconds to wait for workers to signal ready during startup. If None, waits indefinitely.
sgt_torch_dtype
class-attribute
instance-attribute
¶
sgt_torch_dtype: str | None = None
The data type for the Stained Glass Transform. If None, uses the dtype that the SGT's weights are saved in.
For most use cases, we recommend this be "torch.bfloat16"
. If you are seeing unexpectedly large memory consumption, try explicitly
setting this option.
sgt_noise_layer_attention
class-attribute
instance-attribute
¶
sgt_noise_layer_attention: (
Literal[
"sdpa",
"flash_attention_2",
"flash_attention_3",
"flex_attention",
]
| None
) = None
The attention mechanism for the noise layer. If None, uses the attention mechanism saved in the SGT file.
For most use cases, we recommend this to be "flash_attention_2"
. If you are seeing unexpectedly large memory consumption, try explicitly
setting this option.
Warning
Not all attention mechanisms may be available for all dtypes and devices.
max_input_tokens
class-attribute
instance-attribute
¶
max_input_tokens: int | None = None
The maximum number of input tokens to allow in a /v1/chat/completions
or /v1/stainedglass
request.
Requests with token count greater than this value will be rejected with a 413
error code.
output_decryption
class-attribute
instance-attribute
¶
output_decryption: bool = False
Whether to decrypt the output from the upstream service.
Warning
This should only be enabled when the upstream inference service uses output encryption.
ephemeral_key_refresh_time_seconds
class-attribute
instance-attribute
¶
ephemeral_key_refresh_time_seconds: float = 15 * 60
The time in seconds to refresh the ephemeral key, when output decryption is enabled.
reconstruction_max_batch_size
class-attribute
instance-attribute
¶
reconstruction_max_batch_size: int | None = None
This can be used to limit the batch size for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass
endpoint,
but may also increase the processing time.
reconstruction_max_sequence_length
class-attribute
instance-attribute
¶
reconstruction_max_sequence_length: int | None = None
This can be used to limit the sequence length for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass
endpoint,
but may also increase the processing time.
reconstruction_max_num_embeddings
class-attribute
instance-attribute
¶
reconstruction_max_num_embeddings: int | None = None
This can be used to limit the number of embeddings for reconstruction tasks and its memory usage.
Setting this can drastically reduce memory consumption during the attempted reconstruction when using the /v1/stainedglass
endpoint,
but may also increase the processing time.
profile
class-attribute
instance-attribute
¶
profile: bool = False
Enable profiling for the proxy. Enabling this will create a folder for profiling data if it does not already exist.
profile_data_folder
class-attribute
instance-attribute
¶
The folder to store profiling data. This is only used if profiling is enabled.
logging_config_file
class-attribute
instance-attribute
¶
logging_config_file: str = 'logging.yaml'
The logging configuration file.
license_manager_configuration_file
class-attribute
instance-attribute
¶
The path to the license manager configuration file.