turboquant_plugin
vLLM general plugin that patches prompt-embed loading to decode TurboQuant-compressed payloads.
Register via the vllm.general_plugins entry point (declared in pyproject.toml).
As of vLLM 0.21.0, load_general_plugins() is called in every process: in the API server via
EngineArgs.create_engine_config (vllm/engine/arg_utils.py:731), and in the engine subprocess
via EngineCore.__init__ (vllm/v1/engine/core.py:105). The plugin is therefore active in all
processes without any explicit call from entrypoint.py.
The wire format sent by the proxy is a torch.save-ed dict:
Plain float tensors (the existing uncompressed path) are passed through unchanged, so deployment is fully backwards-compatible: old proxy → patched vLLM works, and new proxy → unpatched vLLM fails gracefully (vLLM rejects the dict payload with a clear validation error rather than a silent wrong result).
Warning
Under almost no circumstances should you need to import this module directly.
If stainedglass_output_protection is installed, vLLM will load the plugin automatically.
Functions:
| Name | Description |
|---|---|
register |
Patch vLLM's prompt-embed loader to handle TurboQuant-compressed payloads. |
register
¶
Patch vLLM's prompt-embed loader to handle TurboQuant-compressed payloads.
Called automatically by vLLM's plugin system at startup in all processes —
both the main OpenAI API server process (via EngineArgs.create_engine_config,
vllm/engine/arg_utils.py:731) and the engine subprocess (via EngineCore.__init__,
vllm/v1/engine/core.py:105). Safe to call multiple times (idempotent).
Three patch points are required because vLLM imports the function with
from .embed_utils import safe_load_prompt_embeds, creating local bindings
in each module that won't be updated by patching the source module alone.