Skip to content

turboquant_plugin

vLLM general plugin that patches prompt-embed loading to decode TurboQuant-compressed payloads.

Register via the vllm.general_plugins entry point (declared in pyproject.toml). As of vLLM 0.21.0, load_general_plugins() is called in every process: in the API server via EngineArgs.create_engine_config (vllm/engine/arg_utils.py:731), and in the engine subprocess via EngineCore.__init__ (vllm/v1/engine/core.py:105). The plugin is therefore active in all processes without any explicit call from entrypoint.py.

The wire format sent by the proxy is a torch.save-ed dict:

{"packed": <uint8 tensor>, "norms": <float32 tensor>, "bits": <int>, "d": <int>}

Plain float tensors (the existing uncompressed path) are passed through unchanged, so deployment is fully backwards-compatible: old proxy → patched vLLM works, and new proxy → unpatched vLLM fails gracefully (vLLM rejects the dict payload with a clear validation error rather than a silent wrong result).

Warning

Under almost no circumstances should you need to import this module directly. If stainedglass_output_protection is installed, vLLM will load the plugin automatically.

Functions:

Name Description
register

Patch vLLM's prompt-embed loader to handle TurboQuant-compressed payloads.

register

register() -> None

Patch vLLM's prompt-embed loader to handle TurboQuant-compressed payloads.

Called automatically by vLLM's plugin system at startup in all processes — both the main OpenAI API server process (via EngineArgs.create_engine_config, vllm/engine/arg_utils.py:731) and the engine subprocess (via EngineCore.__init__, vllm/v1/engine/core.py:105). Safe to call multiple times (idempotent).

Three patch points are required because vLLM imports the function with from .embed_utils import safe_load_prompt_embeds, creating local bindings in each module that won't be updated by patching the source module alone.