Client Examples¶
This notebook demonstrates how to connect to the LLM API and make inference requests.
Start llm-api backend¶
Skip this step if you already have the backend running.
Load client side embedding layer¶
The LLM API can accept either text or input token embeddings. Instatiating a client-side model will allow us to convert text to embeddings.
Often, one would load a Stained Glass Transform to generate transformed embeddings which can then be passed in a request to the LLM API.
%env CUDA_VISIBLE_DEVICES=0
env: CUDA_VISIBLE_DEVICES=0
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import aiohttp, asyncio
device = "cpu"
model_path = "/models/huggingface/mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map=device, torch_dtype=torch.bfloat16
).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
embed = model.get_input_embeddings()
tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
max_length = 2048
/home/nan/.conda/envs/llm-api/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 12.09it/s]
Prepare inputs¶
We specify text and then generate corresponding input embeddings. We will use both text and embedding inputs in the following examples.
import io
instruction = """
[INST]
Summarize this case in three bullet points.
### INPUT:
A 52-year-old man was riding his bicycle along the street when he was hit by a right-turning van. As a result of the crash, he sustained injuries to his leg, lower back, and neck. Represented by attorneys at Block O’Toole & Murphy, he recovered $1.3 million in compensation.
Age and Occupation of Plaintiff
Plaintiff was 52 years old and unemployed at the time of the accident.
Description of the Case
This case stems from an accident, which occurred when Plaintiff was riding his bicycle and was struck by a right turning van owned by Defendants. Plaintiff was riding his bicycle along the left side of the street, in the direction facing oncoming traffic, which was the improper side of the street for Plaintiff to be biking on.
Plaintiff brought his bike to a stop just before the driveway exit of a beach club located at the intersection of Beech Street and Richard Street. The Defendant van exited the beach club, turned right and struck the Plaintiff, knocking him to the ground. The Defendant van driver claimed that he never saw the Plaintiff prior to the impact, and that his van never left the roadway prior to the accident, meaning that Plaintiff had been stopped in the roadway and on the wrong side of the roadway.Injuries/Damages
As a result of the impact, Plaintiff claimed injuries to his neck, lower back and left leg. Specifically, Plaintiff claimed fractures to the left fibular head and shaft, as well as herniated discs at the C6/7, L1/2 and L5/S1 levels of his spine. Ultimately, Plaintiff underwent a neck surgery consisting of anterior cervical discectomy with fusion at the C6/7 level. Plaintiff also underwent a lower back surgery, which involved discectomies and fusions at the L1/2 and L5/S1 levels.
Defendants vigorously contested damages, mainly based on Plaintiff’s prior history of neck and back problems. In this regard, Plaintiff was involved in a prior motor vehicle accident that occurred less than 3 months before the subject accident, for which he injured his neck and back and commenced a lawsuit. Plaintiff was taken to the Emergency Room following that prior accident, where he was diagnosed with rib fractures. Plaintiff then started a course of physical therapy treatments for his neck and lower back that continued right up to the subject accident.
Less than 1 month before the subject accident, Plaintiff had MRI testing that revealed herniated discs at the C6/7 and L5/S1 levels of his spine (2 of the same levels that he claimed were injured in the subject accident). Plaintiff also had a prior lower EMG that revealed nerve damage in the form of a left L5/S1 radiculopathy, and he was documented making continuing complaints of neck and back pain radiating to his left side with numbness and tingling in his hands all the way up until the accident. Based on this extensive prior neck and back injury history, plus the confirmed prior damage to his spine as revealed by prior MRI and EMG testing, the defense argued that all of Plaintiff’s claimed spine injuries pre-existed the accident and were unrelated to the accident.
Plaintiff had no lost earnings claim because he was unemployed at the time of this accident. Plaintiff did pursue claims for medical expenses, pain and suffering and loss of enjoyment of life. The case settled following a private mediation the day before jury selection was scheduled to commence in Kings County Supreme Court.
Settlement Amount
$1,300,000
Attorneys
This case was handled by firm Partners David L. Scher and Stephen J. Murphy.
[/INST]
"""
input_ids = tokenizer.encode(instruction, return_tensors="pt").to(model.device)
with torch.no_grad():
inputs_embeds = embed(input_ids)
buff = io.BytesIO()
torch.save(inputs_embeds.detach().cpu(), buff)
payload = buff.getvalue()
optional_generation_parameters = {
"max_new_tokens": 200,
"seed": 100,
}
optional_generation_parameters_bytes = {
k: str(v).encode() for k, v in optional_generation_parameters.items()
}
Run inference over REST API¶
The LLM API provides a REST interface for making inference requests.
from typing import Any
async def post(idx:int, data: dict[str, Any], stream:bool=True) -> None:
"""Post data to the server and print the response.
Args:
idx: Index of the request.
data: Data to send.
stream: Whether to stream the response.
"""
async with aiohttp.ClientSession() as session:
async with session.post(
"http://localhost:8080/predictions/mistral-7b-instruct",
data=data,
) as resp:
if stream:
async for chunk in resp.content.iter_chunked(32):
text = chunk.decode()
print(f"{idx}: {chunk.decode()}")
if "<|endoftext|>" in text:
break
else:
generated = await resp.text()
print(f"{idx}:\n{generated}")
Inference on texts and embeddings¶
Both text and corresponding input embeddings can be passed to the API. The API will return the same output for both.
await asyncio.gather(
*[
post(idx, data, False)
for idx, data in enumerate(
(
{"text": instruction, **optional_generation_parameters},
{"inputs_embeds": payload, **optional_generation_parameters_bytes},
)
)
]
)
0: * A 52-year-old unemployed man sustained injuries to his leg, lower back, and neck after being struck by a right-turning van while riding his bicycle on the wrong side of the road. * The man claimed fractures to his left fibular head and shaft, as well as herniated discs at the C6/7, L1/2, and L5/S1 levels of his spine, requiring neck and lower back surgeries. * Despite the man's extensive prior neck and back injury history and conflicting medical evidence, attorneys David L. Scher and Stephen J. Murphy secured a $1.3 million settlement for their client following a private mediation.<|endoftext|> 1: * A 52-year-old unemployed man sustained injuries to his leg, lower back, and neck after being struck by a right-turning van while riding his bicycle on the wrong side of the road. * The man claimed fractures to his left fibular head and shaft, as well as herniated discs at the C6/7, L1/2, and L5/S1 levels of his spine, requiring neck and lower back surgeries. * Despite the man's extensive prior neck and back injury history and conflicting medical evidence, attorneys David L. Scher and Stephen J. Murphy secured a $1.3 million settlement for their client following a private mediation.<|endoftext|>
[None, None]
Inference on texts and embeddings with streaming¶
Both text and corresponding input embeddings can be passed to the API. The API will return the same output for both.
This time, because the responses are streamed asynchronously, we can see that both text and embedding inputs are processed in parallel.
await asyncio.gather(
*[
post(idx, data)
for idx, data in enumerate(
(
{"text": instruction, **optional_generation_parameters},
{"inputs_embeds": payload, **optional_generation_parameters_bytes},
)
)
]
)
1: * 0: * 1: A 0: A 0: 52-year-old 1: 52-year-old 1: unemployed 0: unemployed 1: man 0: man 1: sustained 0: sustained 1: injuries 0: injuries 0: to 1: to 0: his 1: his 1: leg, 0: leg, 1: lower 0: lower 1: back, 0: back, 1: and 0: and 1: neck 0: neck 1: after 0: after 0: being 1: being 1: struck 0: struck 1: by 0: by 0: a 1: a 1: right-turning 0: right-turning 1: van 0: van 1: while 0: while 0: riding 1: riding 1: his 0: his 1: bicycle 0: bicycle 0: on 1: on 0: the 1: the 1: wrong 0: wrong 0: side 1: side 1: of 0: of 1: the 0: the 0: road. 1: road. 1: * 0: * 0: The 1: The 0: man 1: man 1: claimed 0: claimed 0: fractures 1: fractures 1: to 0: to 0: his 1: his 1: left 0: left 0: fibular 1: fibular 0: head 1: head 0: and 1: and 0: shaft, 1: shaft, 0: as 1: as 0: well 1: well 0: as 1: as 0: herniated 1: herniated 0: discs 1: discs 0: at 1: at 0: the 1: the 0: C6/7, 1: C6/7, 0: L1/2, 1: L1/2, 0: and 1: and 1: L5/S1 0: L5/S1 0: levels 1: levels 0: of 1: of 1: his 0: his 0: spine, 1: spine, 0: requiring 1: requiring 1: neck 0: neck 0: and 1: and 0: lower 1: lower 1: back 0: back 1: surgeries. 0: surgeries. 0: * 1: * 0: Despite 1: Despite 0: the 1: the 0: man's 1: man's 0: extensive 1: extensive 0: prior 1: prior 0: neck 1: neck 0: and 1: and 0: back 1: back 0: injury 1: injury 0: history 1: history 0: and 1: and 0: conflicting 1: conflicting 0: medical 1: medical 0: evidence, 1: evidence, 0: attorneys 1: attorneys 0: David 1: David 0: L. 1: L. 0: Scher 1: Scher 0: and 1: and 0: Stephen 1: Stephen 0: J. 1: J. 0: Murphy 1: Murphy 0: secured 1: secured 0: a 1: a 1: $1.3 0: $1.3 0: million 1: million 0: settlement 1: settlement 1: for 0: for 1: their 0: their 1: client 0: client 0: following 1: following 0: a 1: a 1: private 0: private 1: mediation.<|endoftext|> 0: mediation.<|endoftext|>
[None, None]
%%bash
make grpc/generate-client
if [ -d pb2 ]; then rm -r pb2 fi mkdir pb2 python -m grpc_tools.protoc --python_out=pb2 --grpc_python_out=pb2 --proto_path=llm_api/proto llm_api/proto/inference.proto llm_api/proto/management.proto llm_api/proto/status.proto
import sys
sys.path.append("pb2")
import grpc
import inference_pb2, inference_pb2_grpc
async def grpc_post(idx: int, data: dict[str, Any]) -> None:
"""Post data to the server using gRPC and print the response.
Args:
idx: Index of the request.
data: Data to send.
"""
async with grpc.aio.insecure_channel("localhost:7070") as channel:
stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
response = stub.StreamPredictions(
inference_pb2.PredictionsRequest(model_name="mistral-7b-instruct", input=data)
)
async for chunk in response:
text = chunk.prediction.decode()
print(f"{idx}: {text}")
if "<|endoftext|>" in text:
break
Inference on texts and embeddings with streaming¶
gRPC can be used to stream text and embeddings to the API. The API will return the same output for both.
Danger
Only streaming is available via gRPC. Do not attempt to make a non-streaming request via gRPC to LLM-API
.
Warning
TorchServe forces batch_size to 1 when using gRPC streaming.
await asyncio.gather(
*[
grpc_post(idx, {**data, **optional_generation_parameters_bytes})
for idx, data in enumerate(
({"text": instruction.encode()}, {"inputs_embeds": payload})
)
]
)
0: 0: * 0: A 0: 0: 0: 0: 0: 0: 0: 52-year-old 0: 0: unemployed 0: man 0: sustained 0: injuries 0: to 0: his 0: 0: leg, 0: lower 0: 0: back, 0: and 0: neck 0: after 0: being 0: struck 0: by 0: a 0: 0: 0: 0: right-turning 0: van 0: while 0: riding 0: his 0: 0: bicycle 0: on 0: the 0: wrong 0: side 0: of 0: the 0: 0: road. 0: 0: * 0: The 0: man 0: claimed 0: 0: fractures 0: to 0: his 0: left 0: 0: fibular 0: head 0: and 0: 0: shaft, 0: as 0: well 0: as 0: 0: 0: herniated 0: 0: discs 0: at 0: the 0: 0: 0: 0: 0: C6/7, 0: 0: 0: 0: 0: L1/2, 0: and 0: 0: 0: 0: 0: L5/S1 0: levels 0: of 0: his 0: 0: spine, 0: leading 0: to 0: neck 0: and 0: lower 0: back 0: 0: 0: 0: surgeries. 0: 0: * 0: Despite 0: the 0: 0: 0: man's 0: extensive 0: prior 0: neck 0: and 0: back 0: injury 0: history 0: and 0: 0: conflicting 0: medical 0: 0: evidence, 0: 0: attorneys 0: David 0: 0: L. 0: 0: Scher 0: and 0: Stephen 0: 0: J. 0: Murphy 0: secured 0: a 0: 0: 0: 0: $1.3 0: million 0: settlement 0: for 0: their 0: client 0: following 0: a 0: private 0: 0: 0: mediation.<|endoftext|> 1: 1: * 1: A 1: 1: 1: 1: 1: 1: 1: 52-year-old 1: 1: unemployed 1: man 1: sustained 1: injuries 1: to 1: his 1: 1: leg, 1: lower 1: 1: back, 1: and 1: neck 1: after 1: being 1: struck 1: by 1: a 1: 1: 1: 1: right-turning 1: van 1: while 1: riding 1: his 1: 1: bicycle 1: on 1: the 1: wrong 1: side 1: of 1: the 1: 1: road. 1: 1: * 1: The 1: man 1: claimed 1: 1: fractures 1: to 1: his 1: left 1: 1: fibular 1: head 1: and 1: 1: shaft, 1: as 1: well 1: as 1: 1: 1: herniated 1: 1: discs 1: at 1: the 1: 1: 1: 1: 1: C6/7, 1: 1: 1: 1: 1: L1/2, 1: and 1: 1: 1: 1: 1: L5/S1 1: levels 1: of 1: his 1: 1: spine, 1: leading 1: to 1: neck 1: and 1: lower 1: back 1: 1: 1: 1: surgeries. 1: 1: * 1: Despite 1: the 1: 1: 1: man's 1: extensive 1: prior 1: neck 1: and 1: back 1: injury 1: history 1: and 1: 1: conflicting 1: medical 1: 1: evidence, 1: 1: attorneys 1: David 1: 1: L. 1: 1: Scher 1: and 1: Stephen 1: 1: J. 1: Murphy 1: secured 1: a 1: 1: 1: 1: $1.3 1: million 1: settlement 1: for 1: their 1: client 1: following 1: a 1: private 1: 1: 1: mediation.<|endoftext|>
[None, None]