Basic usage
Thread-Safe Local Inference
For convenience, let's assume that the TensorRT inference functionality is encapsulated as a "computational backend" named TensorrtTensor. Since the computation occurs on the GPU device, we add SyncTensor to represent the stream synchronization operation on the GPU.
| Configuration | Parameter | Description |
|---|---|---|
| backend | "SyncTensor[TensorrtTensor]" | The computational backend, like TensorRT inference itself, is not thread-safe. |
| max | 4 | The maximum batch size supported by the model, used for model conversion (ONNX->TensorRT). |
By default, TorchPipe wraps an extensible single-node scheduling backend on top of this "computational backend," which provides the following three basic capabilities:
- Thread safety of the forward interface
-
Multi-instance parallelism
Configuration Default Description instance_num 1 Perform inference tasks in parallel with multiple model instances. -
Batching
Configuration Default Description batching_timeout 0 The timeout in milliseconds.
Performance tuning tips
Summarizing the above steps, we obtain the necessary parameters for inference of ResNet18 under TorchPipe:
import torchpipe as tp
import torch
config = {
# Single-node scheduler parameters:
"instance_num": 2,
"batching_timeout": 5,
# Computational backend:
"backend": "SyncTensor[TensorrtTensor]",
# Computational backend parameters:
"model": "resnet18_-1x3x224x224.onnx",
"max": 4
}
# Initialization
models = tp.pipe(config)
data = torch.ones(1, 3, 224, 224).cuda()
## Forward
input = {"data": data}
models(input) # <== Can be called from multiple threads
result: torch.Tensor = torch.from_dlpack(input["result"]) # "result" does not exist if the inference failed
Assuming that we want to support a maximum of 10 clients/concurrent requests, the instance_num is usually set to 2, so that we can handle up to instance_num * max = 8 requests at most.
Sequential
Sequential can link multiple backends together. In other words, Sequential[DecodeTensor,ResizeTensor,CvtColorTensor,SyncTensor] and Sequential[DecodeMat,ResizeMat] are valid backends.
During the forward execution of Sequential[DecodeMat,ResizeMat], the data (dict) will go through the following process in sequence:
- Execute
DecodeMat:DecodeMatreadsdataand assigns the result toresultandcolor. - Conditional control flow: attempts to assign the value of
resultin the data todataand deletesresult. - Execute
ResizeMat:ResizeMatreadsdataand assigns the result to theresultkey.
Sequential can be abbreviated as S.
Custom backends
A major problem in business is that the preset backends (computational backend/scheduling backend/RPC backend/cross-process backend, etc.) cannot cover all requirements. Torchpipe treat the backend itself is also an API oriented towards users.
Basic Types
any
Similar to std::any in C++17, we have defined a type-erased container, omniback::any, with an almost identical interface.
dict
As a data carrier, similar to Python's dict, we have also defined the following dict in C++:
#ifndef CUSTOM_DICT
using dict = std::shared_ptr<std::unordered_map<std::string, omniback::any>>;
#else
#endif
Backend
Torchpipe limits the basic elements of the backend to:
- Initialization: parameter configuration
- Forward: input/output interface
- max/min: batch range of data