Skip to main content

Torch-related Backends

Overview

NameMain Initialization ParametersInput[Type]Output[Type]Note
DecodeTensorcolor(default=rgb) ;data_format(default=nchw)data[str/bytes]result[at::Tensor]
color[str] color[rgb,bgr]
data_format[nchw,hwc] v0.3.2rc3
cvtColorTensorcolordata[at::Tensor]
color[str]
result[at::Tensor]
ResizeTensorresize_h,resize_wdata[at::Tensor]result[at::Tensor]
PillowResizeTensorresize_h,resize_wdata[at::Tensor]result[at::Tensor]CV_8UC3
ResizePadTensormax_h,max_w,pad_valuedata[at::Tensor]- result[at::Tensor]
- inverse_trans
[std::function<std::pair<float, float>(float x, float y)>]
TensorrtTensormodel,instance_num,
max,precision,
mean,std,model::cache
TensorRT inference engine
Tensor2Matdata[at::Tensor]result[cv::Mat]
Tensor2Vectordata[at::Tensor]result[std::vector]
SyncTensorSyncTensor::backenddata[at::Tensor]result[at::Tensor]CUDA Stream Synchronization Facility
Torchdevice_id Torch::backenddata[at::Tensor]result[at::Tensor]
SaveTensorsave_dirdata[at::Tensor]result[at::Tensor]
LoadTensortensor_nameresult[at::Tensor]
C10ExceptionThrows c10::Error exception, used to simulate internal Torch exceptions

DecodeTensor

  • Calls nvjpeg for GPU decoding, with a limit of h*w<5000*5000. The output data shape is 13hw.
  • If the decoded image is empty, there will be no result key value output.

cvtColorTensor

  • The color parameter initialized at the beginning is the target color space. The color read from the input is the color space of the data. If they are different, a color space conversion will be performed. Otherwise, the input value will be returned.
  • color currently supports "rgb" and "bgr".
  • The input must be in the shape of 13hw.

ResizeTensor

  • Calls at::upsample_bilinear2d for resizing.
  • resize_h and resize_w must be integers, with a valid range of [1, 1024 * 1024].
  • The input must be in the shape of 13hw.
  • The output is 13hw, with float data type.

PillowResizeTensor

  • The input at::Tensor type must be at::kByte, with a shape of 13hw.
  • resize_h and resize_w must be integers, with a valid range of [1, 1024 * 1024].
  • Strictly maintains consistency with the bilinear interpolation results of Pillow (verified on a large amount of data).

ResizePadTensor

  • Maintains aspect ratio during resizing, aligns to the top left corner, and pads with a constant pad_value.
  • The output at::Tensor type is float, with a shape of 13hw.
  • max_h and max_w must be integers, with a valid range of [1, 1024 * 1024].
  • pad_value supports integers, floating-point numbers, and multiple values separated by commas.
  • inverse_trans: used to map new coordinates to original coordinates.

Tensor2Mat

  • Converts at::Tensor to cv::Mat, while keeping the data type unchanged.
  • The input shape must be hw3 or 13hw.
  • Similar to SyncTensor, it synchronizes the current stream.
    caution

    Please insert stream management operations: Sequential[Tensor2Mat,SyncTensor] or SyncTensor[Tensor2Mat], otherwise Tensor2Mat will use the default CUDA stream.

Tensor2Vector

  • Convert at::Tensor to std::vector while keeping the data type unchanged (>=v0.4.1, currently only supports float)
  • Input shape: hw3 or 13hw
  • Similar to SyncTensor, it will synchronize the current stream.
    caution

    Please insert stream management operations: Sequential[Tensor2Vector,SyncTensor] or SyncTensor[Tensor2Vector, otherwise Tensor2Vector will use the default CUDA stream.

SyncTensor

  • SyncTensor::backend: default=Identity
  • Usage:
    • SyncTensor[BackendTensor]
    • Sequential[ATensor,BTensor,SyncTensor]
    • Nested mode only executes stream synchronization once.
  • When used directly in the pytorch environment (without going through the scheduling backend), this backend will not take effect in order to be compatible with pytorch cuda semantics. When going through the default scheduling backend, initialization and forward can be considered to be completed in the same independent thread.
  • Aliases: TensorSync, Torch (effective from version 0.3.1b2)
Implementation details
  • The scheduling system ensures that the initialization and forward of the backend instance are executed in the same independent thread. Torch perceives that it is in this independent thread mode before activating its own functionality.
  • Torch will determine whether the current thread is bound to the default stream during initialization. If so, it will activate its own functionality: switch the thread to an independent stream during initialization and perform stream synchronization during forward.
  • The Sequential container ensures that the initialization order of its sub-backends is opposite to the forward order, such as Sequential[SyncTensor[A],SyncTensor[B]], which initializes in reverse order and forwards in order:

SyncTensor[A] is not the default stream during initialization, so it does not need to set a new stream or be responsible for stream synchronization during forward. At this time, SyncTensor[B] sets a new stream, so SyncTensor[B] is responsible for stream synchronization.

  • Mat2Tensor and Tensor2Mat backends have their own stream synchronization functions for the current stream. However, they cannot change the stream bound by the thread and still need to switch to an independent stream through S[Tensor2Mat,...,SyncTensor], otherwise it will affect performance.

Torch

Similar to SyncTensor, but with additional cross-card functionality. Effective from version 0.3.2b1.

  • Torch::backend: required, like Torch[TensorrtTensor]
  • device_id: Default is -1, which sets the current device to this ID. During initialization, it is equivalent to calling c10::cuda::set_device(device_id) or torch.cuda.set_device(device_id). During forward propagation, the input data type must be at::Tensor or vector<at::Tensor>. This backend will move the data to the specified graphics card.

SaveTensor

  • save_dir: Directory for file saving, which needs to be created in advance.
  • The file name suffix is .pt, and the name is unique.

TensorrtTensor

TensorRT inference engine.

Initialization

The following are initialization parameters:

ParameterDescriptionNote
modelModel pathSupports
- onnx files ending with .onnx
- tensorrt engine files ending with .trt
- encrypted files ending with .onnx.encrypted and .trt.encrypted
instance_numNumber of instancesIf the number of profiles in the tensorrt engine is not enough to establish enough instances, multiple engines will be deserialized.
postprocessorCustom post-processingCustom C++ batch post-processing for network output; the default operation is to split the batch dimension; needs to be implemented as a subclass of PostProcessor and registered.

For onnx models, there are the following additional parameters:

ParameterDescriptionNote
min/maxMinimum/maximum input of the modelIn form, it can be 1, 1x3x224x224, or 1,1 (for multi-input networks).
When instance_num > 1, it can be multiple configurations separated by ;.
precisionModel precision[fp32,fp16,int8,best]. The default value for versions <=0.3.1b1 is fp16, and for versions >0.3.1b1 it is [fp16(SM>6.1), fp32(SM<=6.1)]; If the precision is not supported, it will automatically degrade to a supported precision.
precision::fp32Set the precision of some layers to fp32 (overrides the precision setting)Layer name(s) (can provide only part of the name), separated by commas.
precision::fp16Set the precision of some layers to fp16 (overrides the precision setting)Layer name(s) (can provide only part of the name), separated by commas.
precision::output::fp32Set the output precision of some layers to fp32 (overrides the precision setting)Layer name(s) (can provide only part of the name), separated by commas; (>=0.3.1b2)
precision::output::fp16Set the output precision of some layers to fp16 (overrides the precision setting)Layer name(s) (can provide only part of the name), separated by commas; (>=0.3.1b2)
mean/stdSubtract mean and divide variance parameters in image preprocessingThis operation will be inserted into the tensorrt network. Needs to be greater than 1+1e-5 (>=0.3.1b2)
model::cacheAutomatically cache model file pathSupports file names with .trt and .trt.encrypted suffixes. If the file does not exist, it will be automatically saved; otherwise, this file will be loaded directly.

For quantizing onnx models using tensorrt, the following parameters are available:

ParameterDescriptionNoteStarting Version
calibrate_inputCalibration input directoryTensors saved using torch.save or SaveTensor backend, with a shape of 1chw.
calibrate_cacheOptional. Calibration cache, for example "resnet18.cache". If it exists, calibrate_input will be skipped.Calibration can be expensive and can be cached to a file. If the network structure or input dataset changes, the network should be recalibrated.>=0.3.0b4

See example.

Forward Computation

DescriptionNote
TASK_DATA_KEYWhen the network has a single input and output, the type is at::Tensor/torch.Tensor. When the network has multiple inputs and outputs, the type is vector/List.Sorted in lexicographic order for trt<=9
TASK_RESULT_KEY (Output)The output type is the same as the input type, and the postprocessor can customize the output.Sorted in lexicographic order for trt<=9

min()/max()

The input range will be read from the TensorRT model.

Postprocessing Extension

To facilitate batch postprocessing, TensorrtTensor introduces postprocessing extensions. The base class is:

template <typename T=at::Tensor>
class PostProcessor {
public:
virtual bool init(const std::unordered_map<std::string, std::string>& /*config*/,
dict /*dict_config*/) {
return true;
};

virtual void forward(std::vector<T> net_batched_outputs, std::vector<dict> inputs,
const std::vector<T>& net_batched_inputs) {
if (inputs.size() == 1) {
if (net_outputs.size() == 1)
(*inputs[0])[TASK_RESULT_KEY] = net_outputs[0];
else
(*inputs[0])[TASK_RESULT_KEY] = net_outputs;
return;
}
for (std::size_t i = 0; i < inputs.size(); ++i) {
std::vector<T> single_result;
for (const auto& item : net_outputs) {
single_result.push_back(item[i].unsqueeze(0));
}
if (single_result.size() == 1) {
(*inputs[i])[TASK_RESULT_KEY] = single_result[0]; // When `batch==1`, a single value is returned.
} else
(*inputs[i])[TASK_RESULT_KEY] = single_result;
}
}

virtual ~PostProcessor() = default;
};

After inheriting from PostProcessor<at::Tensor> and implementing the forward interface, you can compile and use it using AOT compilation. Built-in postprocessing:

FunctionalityNote
cpuCopy data to CPU
SoftmaxCpuPerform softmax operation on 2D tensor and copy data to CPUfrom v0.3.2b3
SoftmaxMaxGet the maximum value and its corresponding index after performing softmax operation on a 2D tensor.from v0.3.2b3
Reference Implementation
SoftmaxCpu:
#include "prepost.hpp"
class BatchingPostProcSoftmaxCpu : public PostProcessor<at::Tensor> {
public:
void forward(std::vector<at::Tensor> net_outputs, std::vector<dict> input,
const std::vector<at::Tensor>& net_inputs) {
for (auto& item : net_outputs) {
if (item.dim() == 2) {
item = item.softmax(1).cpu(); // Implicit Synchronization
}
}
PostProcessor<at::Tensor>::forward(net_outputs, input, net_inputs);
}
};

IPIPE_REGISTER(PostProcessor<at::Tensor>, BatchingPostProcSoftmaxCpu, "SoftmaxCpu");

LoadTensor

Used to load tensors (.pt files) from disk, which can be saved using torch.save(). If you want to load an image, you can use S[LoadTensor, Tensor2Mat, SyncTensor].