Torch-related Backends
Overview
Name | Main Initialization Parameters | Input[Type] | Output[Type] | Note |
---|---|---|---|---|
DecodeTensor | color(default=rgb) ;data_format(default=nchw) | data[str/bytes] | result[at::Tensor] color[str] color[rgb,bgr] data_format[nchw,hwc] v0.3.2rc3 | |
cvtColorTensor | color | data[at::Tensor] color[str] | result[at::Tensor] | |
ResizeTensor | resize_h,resize_w | data[at::Tensor] | result[at::Tensor] | |
PillowResizeTensor | resize_h,resize_w | data[at::Tensor] | result[at::Tensor] | CV_8UC3 |
ResizePadTensor | max_h,max_w,pad_value | data[at::Tensor] | - result[at::Tensor] - inverse_trans [std::function<std::pair<float, float>(float x, float y)>] | |
TensorrtTensor | model ,instance_num ,max ,precision ,mean ,std ,model::cache | TensorRT inference engine | ||
Tensor2Mat | data[at::Tensor] | result[cv::Mat] | ||
Tensor2Vector | data[at::Tensor] | result[std::vector] | ||
SyncTensor | SyncTensor::backend | data[at::Tensor] | result[at::Tensor] | CUDA Stream Synchronization Facility |
Torch | device_id Torch::backend | data[at::Tensor] | result[at::Tensor] | |
SaveTensor | save_dir | data[at::Tensor] | result[at::Tensor] | |
LoadTensor | tensor_name | result[at::Tensor] | ||
C10Exception | Throws c10::Error exception, used to simulate internal Torch exceptions |
DecodeTensor
- Calls
nvjpeg
for GPU decoding, with a limit ofh*w<5000*5000
. The output data shape is 13hw. - If the decoded image is empty, there will be no
result
key value output.
cvtColorTensor
- The
color
parameter initialized at the beginning is the target color space. Thecolor
read from the input is the color space of the data. If they are different, a color space conversion will be performed. Otherwise, the input value will be returned. color
currently supports "rgb" and "bgr".- The input must be in the shape of 13hw.
ResizeTensor
- Calls
at::upsample_bilinear2d
for resizing. resize_h
andresize_w
must be integers, with a valid range of [1, 1024 * 1024].- The input must be in the shape of 13hw.
- The output is 13hw, with float data type.
PillowResizeTensor
- The input
at::Tensor
type must beat::kByte
, with a shape of 13hw. resize_h
andresize_w
must be integers, with a valid range of [1, 1024 * 1024].- Strictly maintains consistency with the bilinear interpolation results of Pillow (verified on a large amount of data).
ResizePadTensor
- Maintains aspect ratio during resizing, aligns to the top left corner, and pads with a constant
pad_value
. - The output
at::Tensor
type is float, with a shape of 13hw. max_h
andmax_w
must be integers, with a valid range of [1, 1024 * 1024].pad_value
supports integers, floating-point numbers, and multiple values separated by commas.inverse_trans
: used to map new coordinates to original coordinates.
Tensor2Mat
- Converts
at::Tensor
tocv::Mat
, while keeping the data type unchanged. - The input shape must be hw3 or 13hw.
- Similar to SyncTensor, it synchronizes the current stream.caution
Please insert stream management operations:
Sequential[Tensor2Mat,SyncTensor]
orSyncTensor[Tensor2Mat]
, otherwiseTensor2Mat
will use the default CUDA stream.
Tensor2Vector
- Convert
at::Tensor
tostd::vector
while keeping the data type unchanged (>=v0.4.1, currently only supports float) - Input shape: hw3 or 13hw
- Similar to SyncTensor, it will synchronize the current stream.caution
Please insert stream management operations:
Sequential[Tensor2Vector,SyncTensor]
orSyncTensor[Tensor2Vector
, otherwiseTensor2Vector
will use the default CUDA stream.
SyncTensor
- SyncTensor::backend: default=Identity
- Usage:
SyncTensor[BackendTensor]
- Sequential[ATensor,BTensor,SyncTensor]
- Nested mode only executes stream synchronization once.
- When used directly in the pytorch environment (without going through the scheduling backend), this backend will not take effect in order to be compatible with pytorch cuda semantics. When going through the default scheduling backend, initialization and forward can be considered to be completed in the same independent thread.
- Aliases: TensorSync, Torch (effective from version 0.3.1b2)
Implementation details
- The scheduling system ensures that the initialization and forward of the backend instance are executed in the same independent thread. Torch perceives that it is in this independent thread mode before activating its own functionality.
- Torch will determine whether the current thread is bound to the default stream during initialization. If so, it will activate its own functionality: switch the thread to an independent stream during initialization and perform stream synchronization during forward.
- The Sequential container ensures that the initialization order of its sub-backends is opposite to the forward order, such as
Sequential[SyncTensor[A],SyncTensor[B]]
, which initializes in reverse order and forwards in order:
SyncTensor[A] is not the default stream during initialization, so it does not need to set a new stream or be responsible for stream synchronization during forward. At this time, SyncTensor[B] sets a new stream, so SyncTensor[B] is responsible for stream synchronization.
- Mat2Tensor and Tensor2Mat backends have their own stream synchronization functions for the current stream. However, they cannot change the stream bound by the thread and still need to switch to an independent stream through
S[Tensor2Mat,...,SyncTensor]
, otherwise it will affect performance.
Torch
Similar to SyncTensor, but with additional cross-card functionality. Effective from version 0.3.2b1.
- Torch::backend: required, like
Torch[TensorrtTensor]
- device_id: Default is -1, which sets the current device to this ID. During initialization, it is equivalent to calling
c10::cuda::set_device(device_id)
ortorch.cuda.set_device(device_id)
. During forward propagation, the input data type must beat::Tensor
orvector<at::Tensor>
. This backend will move the data to the specified graphics card.
SaveTensor
- save_dir: Directory for file saving, which needs to be created in advance.
- The file name suffix is
.pt
, and the name is unique.
TensorrtTensor
TensorRT inference engine.
Initialization
The following are initialization parameters:
Parameter | Description | Note |
---|---|---|
model | Model path | Supports - onnx files ending with .onnx - tensorrt engine files ending with .trt - encrypted files ending with .onnx.encrypted and .trt.encrypted |
instance_num | Number of instances | If the number of profiles in the tensorrt engine is not enough to establish enough instances, multiple engines will be deserialized. |
postprocessor | Custom post-processing | Custom C++ batch post-processing for network output; the default operation is to split the batch dimension; needs to be implemented as a subclass of PostProcessor and registered. |
For onnx models, there are the following additional parameters:
Parameter | Description | Note |
---|---|---|
min/max | Minimum/maximum input of the model | In form, it can be 1 , 1x3x224x224 , or 1,1 (for multi-input networks). When instance_num > 1 , it can be multiple configurations separated by ; . |
precision | Model precision | [fp32,fp16,int8,best]. The default value for versions <=0.3.1b1 is fp16, and for versions >0.3.1b1 it is [fp16(SM>6.1), fp32(SM<=6.1)]; If the precision is not supported, it will automatically degrade to a supported precision. |
precision::fp32 | Set the precision of some layers to fp32 (overrides the precision setting) | Layer name(s) (can provide only part of the name), separated by commas. |
precision::fp16 | Set the precision of some layers to fp16 (overrides the precision setting) | Layer name(s) (can provide only part of the name), separated by commas. |
precision::output::fp32 | Set the output precision of some layers to fp32 (overrides the precision setting) | Layer name(s) (can provide only part of the name), separated by commas; (>=0.3.1b2) |
precision::output::fp16 | Set the output precision of some layers to fp16 (overrides the precision setting) | Layer name(s) (can provide only part of the name), separated by commas; (>=0.3.1b2) |
mean/std | Subtract mean and divide variance parameters in image preprocessing | This operation will be inserted into the tensorrt network. Needs to be greater than 1+1e-5 (>=0.3.1b2) |
model::cache | Automatically cache model file path | Supports file names with .trt and .trt.encrypted suffixes. If the file does not exist, it will be automatically saved; otherwise, this file will be loaded directly. |
For quantizing onnx models using tensorrt, the following parameters are available:
Parameter | Description | Note | Starting Version |
---|---|---|---|
calibrate_input | Calibration input directory | Tensors saved using torch.save or SaveTensor backend, with a shape of 1chw. | |
calibrate_cache | Optional. Calibration cache, for example "resnet18.cache". If it exists, calibrate_input will be skipped. | Calibration can be expensive and can be cached to a file. If the network structure or input dataset changes, the network should be recalibrated. | >=0.3.0b4 |
See example.
Forward Computation
Description | Note | |
---|---|---|
TASK_DATA_KEY | When the network has a single input and output, the type is at::Tensor/torch.Tensor. When the network has multiple inputs and outputs, the type is vector/List. | Sorted in lexicographic order for trt<=9 |
TASK_RESULT_KEY (Output) | The output type is the same as the input type, and the postprocessor can customize the output. | Sorted in lexicographic order for trt<=9 |
min()/max()
The input range will be read from the TensorRT model.
Postprocessing Extension
To facilitate batch postprocessing, TensorrtTensor introduces postprocessing extensions. The base class is:
template <typename T=at::Tensor>
class PostProcessor {
public:
virtual bool init(const std::unordered_map<std::string, std::string>& /*config*/,
dict /*dict_config*/) {
return true;
};
virtual void forward(std::vector<T> net_batched_outputs, std::vector<dict> inputs,
const std::vector<T>& net_batched_inputs) {
if (inputs.size() == 1) {
if (net_outputs.size() == 1)
(*inputs[0])[TASK_RESULT_KEY] = net_outputs[0];
else
(*inputs[0])[TASK_RESULT_KEY] = net_outputs;
return;
}
for (std::size_t i = 0; i < inputs.size(); ++i) {
std::vector<T> single_result;
for (const auto& item : net_outputs) {
single_result.push_back(item[i].unsqueeze(0));
}
if (single_result.size() == 1) {
(*inputs[i])[TASK_RESULT_KEY] = single_result[0]; // When `batch==1`, a single value is returned.
} else
(*inputs[i])[TASK_RESULT_KEY] = single_result;
}
}
virtual ~PostProcessor() = default;
};
After inheriting from PostProcessor<at::Tensor>
and implementing the forward interface, you can compile and use it using AOT compilation.
Built-in postprocessing:
Functionality | Note | |
---|---|---|
cpu | Copy data to CPU | |
SoftmaxCpu | Perform softmax operation on 2D tensor and copy data to CPU | from v0.3.2b3 |
SoftmaxMax | Get the maximum value and its corresponding index after performing softmax operation on a 2D tensor. | from v0.3.2b3 |
Reference Implementation
SoftmaxCpu
:#include "prepost.hpp"
class BatchingPostProcSoftmaxCpu : public PostProcessor<at::Tensor> {
public:
void forward(std::vector<at::Tensor> net_outputs, std::vector<dict> input,
const std::vector<at::Tensor>& net_inputs) {
for (auto& item : net_outputs) {
if (item.dim() == 2) {
item = item.softmax(1).cpu(); // Implicit Synchronization
}
}
PostProcessor<at::Tensor>::forward(net_outputs, input, net_inputs);
}
};
IPIPE_REGISTER(PostProcessor<at::Tensor>, BatchingPostProcSoftmaxCpu, "SoftmaxCpu");
LoadTensor
Used to load tensors (.pt files) from disk, which can be saved using torch.save()
. If you want to load an image, you can use S[LoadTensor, Tensor2Mat, SyncTensor]
.