Skip to content

LLM-API Deployment Guide

The LLM-API is an torchserve powered inferencing server that supports Hugging Face Transformer models.

Overview

There are two configurations to be aware of, one for the torchserve configuration and the other is the LLM Inferencing Configuration.

LLM Inferencing Configuration

The LLM inferencing configuration contains generation and batching specific settings. These can be controlled via environment variables.

Name Description Example Values
MODEL_NAME The name of the model. llama2-7b-chat
MODEL_PATH The path to the model checkpoint. /models/llama-2-7b-chat-hf
DEVICE_MAP A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (e.g., "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which the model will be allocated, the device map will map the entire model to this device. Passing device_map = 0 means put the whole model on GPU 0. To have Accelerate compute the most optimized device_map automatically, set device_map="auto". auto (DEFAULT)
MAX_INPUT_TOKENS The maximum number of input tokens. 2048 (DEFAULT)
MAX_NEW_TOKENS The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. 2048 (DEFAULT)
DTYPE The torch.dtype to load the model under a specific dtype. bfloat16 (DEFAULT)
QUANTIZATION Supports None, 8bit and 4bit. None (DEFAULT)
END_OF_STREAM_TOKEN Token used to signal the end of stream. This is useful for batched requests to signal completion. <|endoftext|>

TorchServe Configuration

TorchServe is run via the following command:

torchserve --start --foreground --ncs --ts-config config.properties --model-store model-store --models $MODEL_NAME

The image contains a predefined config.properties file that looks like the following:

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_envvars_config=true
install_py_dep_per_model=false
max_request_size=655350000
max_response_size=655350000
default_response_timeout=600
enable_metrics_api=true
default_workers_per_model=1

models={\
  "llama2-7b-chat": {\
    "1.0": {\
        "defaultVersion": true,\
        "batchSize": 2,\
        "maxBatchDelay": 1000,\
        "responseTimeout": 600\
  },\
  "mistral-7b-instruct": {\
    "1.0": {\
        "defaultVersion": true,\
        "batchSize": 2,\
        "maxBatchDelay": 1000,\
        "responseTimeout": 600\
    }\
  }\
}

This file can be overwritten or or tuned for your hardware and environment.

Name Description
inference_address Inference API binding address. Default: http://127.0.0.1:8080
management_address Management API binding address. Default: http://127.0.0.1:8081
metrics_address Metrics API binding address. Default: http://127.0.0.1:8082
enable_envvars_config Enable configuring TorchServe through environment variables. When this option is set to "true", all the static configurations of TorchServe can come through environment variables as well. Default: false
install_py_dep_per_model Controls if the model server will install the python packages using the requirements file supplied with the model archive. Default: false
max_request_size The maximum allowable request size that the TorchServe accepts, in bytes. Default: 6553500
max_response_size The maximum allowable response size that the TorchServe sends, in bytes. Default: 6553500
default_response_timeout Timeout, in seconds, used for all models backend workers before they are deemed unresponsive and rebooted. Default: 120 seconds.
enable_metrics_api Enable or disable metric APIs i.e. it can be either true or false. Default: true

Check out TorchServe's supported properties.

The model specific configuration are shown below. The format is expected to be in JSON.

{
    "modelName": {
        "version": {
            "parameterName1": parameterValue1,
            "parameterName2": parameterValue2,
            "parameterNameN": parameterValueN,
        }
    }
}
Name Description
defaultVersion The default version of a model.
batchSize This is the maximum batch size that a model is expected to handle.
maxBatchDelay This is the maximum batch delay time in ms TorchServe waits to receive batch_size number of requests. If TorchServe doesn't receive batch_size number of requests before this timer time's out, it sends what ever requests that were received to the model handler.
minWorkers The minimum number of workers of a model.
maxWorkers The maximum number of workers of a model.
responseTimeout The timeout in sec of a specific model's response. This setting takes priority over default_response_timeout which is a default timeout over all models.
default_workers_per_model Number of workers to create for each model that loaded at startup time. Default: available GPUs in system or number of logical processors available to the JVM.

Torch Model Archiver

The docker image comes bundled with Torch Model Archiver. torch-model-archiver takes model checkpoints or model definition file with state_dict and packages them into a .mar file. This file can then be redistributed and served by anyone using TorchServe. It takes in the following model artifacts: a model checkpoint file in case of torchscript or a model definition file and a state_dict file in case of eager mode, and other optional assets that may be required to serve the model. The CLI creates a .mar file that TorchServe's server CLI uses to serve the models.

The following torch-model-archiver command is run

torch-model-archiver --model-name $MODEL_NAME --version 1.0 --handler llm_api/custom_handler.py --archive-format no-archive --export-path model-store

API Reference

The OpenAPI spec for LLM-API is provided in API Documentation.

For example calls see client_examples.ipynb.

Invoking LLM-API

LLM-API can be invoked either via http on port 8080 or gRPC via port 7070.

Output format

LLM-API generated responses to input prompts shall be terminated with <|endoftext|> to denote the end of the output. This identified is critical for streaming responses as detailed below.

HTTP

To perform inferencing via HTTP, you need to make a POST call to the endpoint predictions/{model_name} with a json payload. The payload should have a single key specifying the input type (text or inputs_embeds) with the value of that key being the input itself (inputs_embeds being the word embeddings produced by the tokenizer of the model served on torch's backend).

As an example using requests:

import requests

response = requests.post("http://localhost:8080/predictions/$MODEL_NAME", data={"text": instruction, **optional_generation_parameters})
response2 = requests.post("http://localhost:8080/predictions/$MODEL_NAME", data={"inputs_embeds": embeddings, **optional_generation_parameters})

HTTP can also be invoked asynchronously/streaming via a library supporting async, such as aiohttp or httpx. The basic format is the same but the response is returned in iterative chunks until <|endoftext|> is reached. It is up to the user to iterate through the streamed chunks, and to terminate the iteration when <|endoftext|> is returned:

import aiohttp
import asyncio

async def streaming_post(model_endpoint: str, data: dict):
    async with (
        aiohttp.ClientSession() as session,
        session.post(
            model_endpoint,
            data=data,
        ) as resp,
    ):
        async for chunk in resp.content.iter_chunked(32):
            text = chunk.decode()

            ...

            if "<|endoftext|>" in text:
                text = text.replace("<|endoftext|>", "")
                break

async.run(streaming_post("http://localhost:8080/predictions/$MODEL_NAME",
    data={"text": instruction, **optional_generation_parameters})
)
async.run(streaming_post("http://localhost:8080/predictions/$MODEL_NAME",
    data={"input_embeds": embeddings, **optional_generation_parameters})
)

gRPC

Danger

Only streaming is available via gRPC. Do not attempt to make a non-streaming request via gRPC to LLM-API.

Warning

TorchServe forces batch_size to 1 when using gRPC streaming.

To perform inferencing via gRPC, import the supplied gRPC client from LLM-API-client and invoke a request via the gRPC client with an input payload similar in structure to the rest call format, a dictionary with one key specifying text or inputs_embeds as the input with a value of the input itself. The output will be in a streaming format, again terminated by <|endoftext|> once the stream is complete.

As an example using grpc.aio:

from llm_api.proto import inference_pb2, inference_pb2_grpc

...

async def grpc_post(idx, data):
    async with grpc.aio.insecure_channel("localhost:7770") as channel:
        stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
        response = stub.StreamPredictions(
            inference_pb2.PredictionsRequest(model_name=MODEL_NAME, input=data)
        )
        async for chunk in response:
            text = chunk.prediction.decode()

            ...

            if "<|endoftext|>" in text:
                text = text.replace("<|endoftext|>", "")
                break