Torch-related Backends

Overview

Name	Main Initialization Parameters	Input[Type]	Output[Type]	Note
DecodeTensor	`color(default=rgb)` ;`data_format(default=nchw)`	`data[str/bytes]`	result[at::Tensor] color[str] color[rgb,bgr] data_format[nchw,hwc] v0.3.2rc3
cvtColorTensor	color	data[at::Tensor] color[str]	result[at::Tensor]
ResizeTensor	resize_h,resize_w	data[at::Tensor]	result[at::Tensor]
PillowResizeTensor	resize_h,resize_w	data[at::Tensor]	result[at::Tensor]	CV_8UC3
ResizePadTensor	max_h,max_w,pad_value	data[at::Tensor]	- result[at::Tensor] - inverse_trans [std::function<std::pair<float, float>(float x, float y)>]
TensorrtTensor	`model`,`instance_num`, `max`,`precision`, `mean`,`std`,`model::cache`			TensorRT inference engine
Tensor2Mat		data[at::Tensor]	result[cv::Mat]
Tensor2Vector		data[at::Tensor]	result[std::vector]
SyncTensor	`SyncTensor::backend`	data[at::Tensor]	result[at::Tensor]	CUDA Stream Synchronization Facility
Torch	`device_id` `Torch::backend`	data[at::Tensor]	result[at::Tensor]
SaveTensor	save_dir	data[at::Tensor]	result[at::Tensor]
LoadTensor	tensor_name		result[at::Tensor]
C10Exception				Throws c10::Error exception, used to simulate internal Torch exceptions

DecodeTensor

Calls nvjpeg for GPU decoding, with a limit of h*w<5000*5000. The output data shape is 13hw.
If the decoded image is empty, there will be no result key value output.

cvtColorTensor

The color parameter initialized at the beginning is the target color space. The color read from the input is the color space of the data. If they are different, a color space conversion will be performed. Otherwise, the input value will be returned.
color currently supports "rgb" and "bgr".
The input must be in the shape of 13hw.

ResizeTensor

Calls at::upsample_bilinear2d for resizing.
resize_h and resize_w must be integers, with a valid range of [1, 1024 * 1024].
The input must be in the shape of 13hw.
The output is 13hw, with float data type.

PillowResizeTensor

The input at::Tensor type must be at::kByte, with a shape of 13hw.
resize_h and resize_w must be integers, with a valid range of [1, 1024 * 1024].
Strictly maintains consistency with the bilinear interpolation results of Pillow (verified on a large amount of data).

ResizePadTensor

Maintains aspect ratio during resizing, aligns to the top left corner, and pads with a constant pad_value.
The output at::Tensor type is float, with a shape of 13hw.
max_h and max_w must be integers, with a valid range of [1, 1024 * 1024].
pad_value supports integers, floating-point numbers, and multiple values separated by commas.
inverse_trans: used to map new coordinates to original coordinates.

Tensor2Mat

Converts at::Tensor to cv::Mat, while keeping the data type unchanged.
The input shape must be hw3 or 13hw.
Similar to SyncTensor, it synchronizes the current stream.
caution
Please insert stream management operations: Sequential[Tensor2Mat,SyncTensor] or SyncTensor[Tensor2Mat], otherwise Tensor2Mat will use the default CUDA stream.

Tensor2Vector

Convert at::Tensor to std::vector while keeping the data type unchanged (>=v0.4.1, currently only supports float)
Input shape: hw3 or 13hw
Similar to SyncTensor, it will synchronize the current stream.
caution
Please insert stream management operations: Sequential[Tensor2Vector,SyncTensor] or SyncTensor[Tensor2Vector, otherwise Tensor2Vector will use the default CUDA stream.

SyncTensor

SyncTensor::backend: default=Identity
Usage:
- SyncTensor[BackendTensor]
- Sequential[ATensor,BTensor,SyncTensor]
- Nested mode only executes stream synchronization once.
When used directly in the pytorch environment (without going through the scheduling backend), this backend will not take effect in order to be compatible with pytorch cuda semantics. When going through the default scheduling backend, initialization and forward can be considered to be completed in the same independent thread.
Aliases: TensorSync, Torch (effective from version 0.3.1b2)

Implementation details

The scheduling system ensures that the initialization and forward of the backend instance are executed in the same independent thread. Torch perceives that it is in this independent thread mode before activating its own functionality.
Torch will determine whether the current thread is bound to the default stream during initialization. If so, it will activate its own functionality: switch the thread to an independent stream during initialization and perform stream synchronization during forward.
The Sequential container ensures that the initialization order of its sub-backends is opposite to the forward order, such as Sequential[SyncTensor[A],SyncTensor[B]], which initializes in reverse order and forwards in order:

SyncTensor[A] is not the default stream during initialization, so it does not need to set a new stream or be responsible for stream synchronization during forward. At this time, SyncTensor[B] sets a new stream, so SyncTensor[B] is responsible for stream synchronization.

Mat2Tensor and Tensor2Mat backends have their own stream synchronization functions for the current stream. However, they cannot change the stream bound by the thread and still need to switch to an independent stream through S[Tensor2Mat,...,SyncTensor], otherwise it will affect performance.

Torch

Similar to SyncTensor, but with additional cross-card functionality. Effective from version 0.3.2b1.

Torch::backend: required, like Torch[TensorrtTensor]
device_id: Default is -1, which sets the current device to this ID. During initialization, it is equivalent to calling c10::cuda::set_device(device_id) or torch.cuda.set_device(device_id). During forward propagation, the input data type must be at::Tensor or vector<at::Tensor>. This backend will move the data to the specified graphics card.

SaveTensor

save_dir: Directory for file saving, which needs to be created in advance.
The file name suffix is .pt, and the name is unique.

TensorrtTensor

TensorRT inference engine.

Initialization

The following are initialization parameters:

Parameter	Description	Note
model	Model path	Supports - onnx files ending with .onnx - tensorrt engine files ending with .trt - encrypted files ending with .onnx.encrypted and .trt.encrypted
instance_num	Number of instances	If the number of profiles in the tensorrt engine is not enough to establish enough instances, multiple engines will be deserialized.
postprocessor	Custom post-processing	Custom C++ batch post-processing for network output; the default operation is to split the batch dimension; needs to be implemented as a subclass of PostProcessor and registered.

For onnx models, there are the following additional parameters:

Parameter	Description	Note
min/max	Minimum/maximum input of the model	In form, it can be `1`, `1x3x224x224`, or `1,1` (for multi-input networks). When `instance_num > 1`, it can be multiple configurations separated by `;`.
precision	Model precision	[fp32,fp16,int8,best]. The default value for versions <=0.3.1b1 is fp16, and for versions >0.3.1b1 it is [fp16(SM>6.1), fp32(SM<=6.1)]; If the precision is not supported, it will automatically degrade to a supported precision.
precision::fp32	Set the precision of some layers to fp32 (overrides the precision setting)	Layer name(s) (can provide only part of the name), separated by commas.
precision::fp16	Set the precision of some layers to fp16 (overrides the precision setting)	Layer name(s) (can provide only part of the name), separated by commas.
precision::output::fp32	Set the output precision of some layers to fp32 (overrides the precision setting)	Layer name(s) (can provide only part of the name), separated by commas; (>=0.3.1b2)
precision::output::fp16	Set the output precision of some layers to fp16 (overrides the precision setting)	Layer name(s) (can provide only part of the name), separated by commas; (>=0.3.1b2)
mean/std	Subtract mean and divide variance parameters in image preprocessing	This operation will be inserted into the tensorrt network. Needs to be greater than `1+1e-5` (>=0.3.1b2)
model::cache	Automatically cache model file path	Supports file names with .trt and .trt.encrypted suffixes. If the file does not exist, it will be automatically saved; otherwise, this file will be loaded directly.

For quantizing onnx models using tensorrt, the following parameters are available:

Parameter	Description	Note	Starting Version
calibrate_input	Calibration input directory	Tensors saved using torch.save or SaveTensor backend, with a shape of 1chw.
calibrate_cache	Optional. Calibration cache, for example "resnet18.cache". If it exists, calibrate_input will be skipped.	Calibration can be expensive and can be cached to a file. If the network structure or input dataset changes, the network should be recalibrated.	>=0.3.0b4

See example.

Forward Computation

	Description	Note
TASK_DATA_KEY	When the network has a single input and output, the type is at::Tensor/torch.Tensor. When the network has multiple inputs and outputs, the type is vector/List.	Sorted in lexicographic order for trt<=9
TASK_RESULT_KEY (Output)	The output type is the same as the input type, and the postprocessor can customize the output.	Sorted in lexicographic order for trt<=9

min()/max()

The input range will be read from the TensorRT model.

Postprocessing Extension

To facilitate batch postprocessing, TensorrtTensor introduces postprocessing extensions. The base class is:

template <typename T=at::Tensor>
class PostProcessor {
 public:
  virtual bool init(const std::unordered_map<std::string, std::string>& /*config*/,
                    dict /*dict_config*/) {
    return true;
  };

  virtual void forward(std::vector<T> net_batched_outputs, std::vector<dict> inputs,
                       const std::vector<T>& net_batched_inputs) {
    if (inputs.size() == 1) {
      if (net_outputs.size() == 1)
        (*inputs[0])[TASK_RESULT_KEY] = net_outputs[0];
      else
        (*inputs[0])[TASK_RESULT_KEY] = net_outputs;
      return;
    }
    for (std::size_t i = 0; i < inputs.size(); ++i) {
      std::vector<T> single_result;
      for (const auto& item : net_outputs) {
        single_result.push_back(item[i].unsqueeze(0));
      }
      if (single_result.size() == 1) {
        (*inputs[i])[TASK_RESULT_KEY] = single_result[0];  // When `batch==1`, a single value is returned. 
      } else
        (*inputs[i])[TASK_RESULT_KEY] = single_result;
    }
  }

  virtual ~PostProcessor() = default;
};

After inheriting from PostProcessor<at::Tensor> and implementing the forward interface, you can compile and use it using AOT compilation. Built-in postprocessing:

	Functionality	Note
cpu	Copy data to CPU
SoftmaxCpu	Perform softmax operation on 2D tensor and copy data to CPU	from v0.3.2b3
SoftmaxMax	Get the maximum value and its corresponding index after performing softmax operation on a 2D tensor.	from v0.3.2b3

Reference Implementation

SoftmaxCpu:

#include "prepost.hpp"
class BatchingPostProcSoftmaxCpu : public PostProcessor<at::Tensor> {
 public:
  void forward(std::vector<at::Tensor> net_outputs, std::vector<dict> input,
               const std::vector<at::Tensor>& net_inputs) {
    for (auto& item : net_outputs) {
      if (item.dim() == 2) {
        item = item.softmax(1).cpu();  // Implicit Synchronization
      }
    }
    PostProcessor<at::Tensor>::forward(net_outputs, input, net_inputs);
  }
};

IPIPE_REGISTER(PostProcessor<at::Tensor>, BatchingPostProcSoftmaxCpu, "SoftmaxCpu");

LoadTensor

Used to load tensors (.pt files) from disk, which can be saved using torch.save(). If you want to load an image, you can use S[LoadTensor, Tensor2Mat, SyncTensor].

Overview​

DecodeTensor​

cvtColorTensor​

ResizeTensor​

PillowResizeTensor​

ResizePadTensor​

Tensor2Mat​

Tensor2Vector​

SyncTensor​

Torch​

SaveTensor​

TensorrtTensor​

Initialization​

Forward Computation​

min()/max()​

Postprocessing Extension​

LoadTensor​