LLM-API Deployment Guide¶
The LLM-API
is an torchserve powered inferencing server that supports Hugging Face Transformer models.
Overview¶
There are two configurations to be aware of, one for the torchserve configuration and the other is the LLM Inferencing Configuration.
LLM Inferencing Configuration¶
The LLM inferencing configuration contains generation and batching specific settings. These can be controlled via environment variables.
Name | Description | Example Values |
---|---|---|
MODEL_NAME | The name of the model. | llama2-7b-chat |
MODEL_PATH | The path to the model checkpoint. | /models/llama-2-7b-chat-hf |
DEVICE_MAP | A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (e.g., "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which the model will be allocated, the device map will map the entire model to this device. Passing device_map = 0 means put the whole model on GPU 0. To have Accelerate compute the most optimized device_map automatically, set device_map="auto". | auto (DEFAULT) |
MAX_INPUT_TOKENS | The maximum number of input tokens. | 2048 (DEFAULT) |
MAX_NEW_TOKENS | The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. | 2048 (DEFAULT) |
DTYPE | The torch.dtype to load the model under a specific dtype. | bfloat16 (DEFAULT) |
QUANTIZATION | Supports None , 8bit and 4bit . |
None (DEFAULT) |
END_OF_STREAM_TOKEN | Token used to signal the end of stream. This is useful for batched requests to signal completion. | <|endoftext|> |
TorchServe Configuration¶
TorchServe is run via the following command:
torchserve --start --foreground --ncs --ts-config config.properties --model-store model-store --models $MODEL_NAME
The image contains a predefined config.properties file that looks like the following:
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_envvars_config=true
install_py_dep_per_model=false
max_request_size=655350000
max_response_size=655350000
default_response_timeout=600
enable_metrics_api=true
default_workers_per_model=1
models={\
"llama2-7b-chat": {\
"1.0": {\
"defaultVersion": true,\
"batchSize": 2,\
"maxBatchDelay": 1000,\
"responseTimeout": 600\
},\
"mistral-7b-instruct": {\
"1.0": {\
"defaultVersion": true,\
"batchSize": 2,\
"maxBatchDelay": 1000,\
"responseTimeout": 600\
}\
}\
}
This file can be overwritten or or tuned for your hardware and environment.
Name | Description |
---|---|
inference_address | Inference API binding address. Default: http://127.0.0.1:8080 |
management_address | Management API binding address. Default: http://127.0.0.1:8081 |
metrics_address | Metrics API binding address. Default: http://127.0.0.1:8082 |
enable_envvars_config | Enable configuring TorchServe through environment variables. When this option is set to "true", all the static configurations of TorchServe can come through environment variables as well. Default: false |
install_py_dep_per_model | Controls if the model server will install the python packages using the requirements file supplied with the model archive. Default: false |
max_request_size | The maximum allowable request size that the TorchServe accepts, in bytes. Default: 6553500 |
max_response_size | The maximum allowable response size that the TorchServe sends, in bytes. Default: 6553500 |
default_response_timeout | Timeout, in seconds, used for all models backend workers before they are deemed unresponsive and rebooted. Default: 120 seconds. |
enable_metrics_api | Enable or disable metric APIs i.e. it can be either true or false. Default: true |
Check out TorchServe's supported properties.
The model specific configuration are shown below. The format is expected to be in JSON.
{
"modelName": {
"version": {
"parameterName1": parameterValue1,
"parameterName2": parameterValue2,
"parameterNameN": parameterValueN,
}
}
}
Name | Description |
---|---|
defaultVersion | The default version of a model. |
batchSize | This is the maximum batch size that a model is expected to handle. |
maxBatchDelay | This is the maximum batch delay time in ms TorchServe waits to receive batch_size number of requests. If TorchServe doesn't receive batch_size number of requests before this timer time's out, it sends what ever requests that were received to the model handler. |
minWorkers | The minimum number of workers of a model. |
maxWorkers | The maximum number of workers of a model. |
responseTimeout | The timeout in sec of a specific model's response. This setting takes priority over default_response_timeout which is a default timeout over all models. |
default_workers_per_model | Number of workers to create for each model that loaded at startup time. Default: available GPUs in system or number of logical processors available to the JVM. |
Torch Model Archiver¶
The docker image comes bundled with Torch Model Archiver. torch-model-archiver
takes model checkpoints or model definition file with state_dict and packages them into a .mar
file. This file can then be redistributed and served by anyone using TorchServe. It takes in the following model artifacts: a model checkpoint file in case of torchscript or a model definition file and a state_dict file in case of eager mode, and other optional assets that may be required to serve the model. The CLI creates a .mar
file that TorchServe's server CLI uses to serve the models.
The following torch-model-archiver
command is run
torch-model-archiver --model-name $MODEL_NAME --version 1.0 --handler llm_api/custom_handler.py --archive-format no-archive --export-path model-store
API Reference¶
The OpenAPI spec for LLM-API is provided in API Documentation.
For example calls see client_examples.ipynb.
Invoking LLM-API¶
LLM-API
can be invoked either via http on port 8080 or gRPC via port 7070.
Output format¶
LLM-API
generated responses to input prompts shall be terminated with <|endoftext|>
to denote the end of the output. This identified is critical for streaming responses as detailed below.
HTTP¶
To perform inferencing via HTTP, you need to make a POST
call to the endpoint predictions/{model_name}
with a json payload. The payload should have a single key specifying the input type (text
or inputs_embeds
) with the value of that key being the input itself (inputs_embeds
being the word embeddings produced by the tokenizer of the model served on torch's backend).
As an example using requests
:
import requests
response = requests.post("http://localhost:8080/predictions/$MODEL_NAME", data={"text": instruction, **optional_generation_parameters})
response2 = requests.post("http://localhost:8080/predictions/$MODEL_NAME", data={"inputs_embeds": embeddings, **optional_generation_parameters})
HTTP can also be invoked asynchronously/streaming via a library supporting async
, such as aiohttp
or httpx
. The basic format is the same but the response is returned in iterative chunks until <|endoftext|>
is reached. It is up to the user to iterate through the streamed chunks, and to terminate the iteration when <|endoftext|>
is returned:
import aiohttp
import asyncio
async def streaming_post(model_endpoint: str, data: dict):
async with (
aiohttp.ClientSession() as session,
session.post(
model_endpoint,
data=data,
) as resp,
):
async for chunk in resp.content.iter_chunked(32):
text = chunk.decode()
...
if "<|endoftext|>" in text:
text = text.replace("<|endoftext|>", "")
break
async.run(streaming_post("http://localhost:8080/predictions/$MODEL_NAME",
data={"text": instruction, **optional_generation_parameters})
)
async.run(streaming_post("http://localhost:8080/predictions/$MODEL_NAME",
data={"input_embeds": embeddings, **optional_generation_parameters})
)
gRPC¶
Danger
Only streaming is available via gRPC. Do not attempt to make a non-streaming request via gRPC to LLM-API
.
Warning
TorchServe forces batch_size to 1 when using gRPC streaming.
To perform inferencing via gRPC, import the supplied gRPC client from LLM-API-client
and invoke a request via the gRPC client with an input payload similar in structure to the rest call format, a dictionary with one key specifying text
or inputs_embeds
as the input with a value of the input itself. The output will be in a streaming format, again terminated by <|endoftext|>
once the stream is complete.
As an example using grpc.aio
:
from llm_api.proto import inference_pb2, inference_pb2_grpc
...
async def grpc_post(idx, data):
async with grpc.aio.insecure_channel("localhost:7770") as channel:
stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
response = stub.StreamPredictions(
inference_pb2.PredictionsRequest(model_name=MODEL_NAME, input=data)
)
async for chunk in response:
text = chunk.prediction.decode()
...
if "<|endoftext|>" in text:
text = text.replace("<|endoftext|>", "")
break