Parallel Inference of ResNet18

Thread-Safe Local Inference

For convenience, let's assume that the TensorRT inference functionality is encapsulated as a "computational backend" named TensorrtTensor. Since the computation occurs on the GPU device, we add SyncTensor to represent the stream synchronization operation on the GPU.

Configuration	Parameter	Description
backend	"SyncTensor[TensorrtTensor]"	The computational backend, like TensorRT inference itself, is not thread-safe.
max	4	The maximum batch size supported by the model, used for model conversion (ONNX->TensorRT).

If a function can be called from multiple threads simultaneously, it is thread-safe. Providing a thread-safe interface will greatly facilitate the users.

By default, TorchPipe wraps an extensible single-node scheduling backend on top of this "computational backend," which provides the following three basic capabilities:

Thread safety of the forward interface
Multi-instance parallelism
Configuration Default Description
instance_num 1 Perform inference tasks in parallel with multiple model instances.

Configuration	Default	Description
instance_num	1	Perform inference tasks in parallel with multiple model instances.

Batching
For ResNet18, the model itself takes -1x3x224x224 as input. The larger the batch size, the more tasks can be completed per unit of hardware resources. The batch size is read from the "computational backend" (TensorrtTensor).
Configuration Default Description
batching_timeout 0 The timeout in milliseconds. If no requests for the number of batch sizes are received within this time, the waiting is abandoned.

Configuration	Default	Description
batching_timeout	0	The timeout in milliseconds. If no requests for the number of batch sizes are received within this time, the waiting is abandoned.

Performance tuning tips

Summarizing the above steps, we obtain the necessary parameters for inference of ResNet18 under TorchPipe:

Python
C++

import torchpipe as tp
import torch

config = {
    # Single-node scheduler parameters: 
    "instance_num": 2,
    "batching_timeout": 5,
    # Computational backend:
    "backend": "SyncTensor[TensorrtTensor]",
    # Computational backend parameters:
    "model": "resnet18_-1x3x224x224.onnx",
    "max": 4
}

# Initialization
models = tp.pipe(config)
data = torch.ones(1, 3, 224, 224).cuda()

## Forward
input = {"data": data}
models(input) # <== Can be called from multiple threads
result: torch.Tensor = input["result"] # "result" does not exist if the inference failed

#include "Interpreter.hpp"
#include "ATen/ATen.h"

int main() {
  // Set up configuration for single-node scheduler and computational backend
  std::unordered_map<std::string, std::string> config = {
    {"instance_num", "2"},
    {"batching_timeout", "5"},
    {"backend", "SyncTensor[TensorrtTensor]"},
    {"model", "resnet18_-1x3x224x224.onnx"},
    {"max", "4"}
  };

  // Initialize interpreter model
  ipipe::Interpreter model(config);

  // Prepare input data
  auto input = std::make_shared<std::unordered_map<std::string, ipipe::any>>();
  (*input)["data"] = at::empty({1, 3, 224, 224}, at::TensorOptions().device(at::kCUDA, -1).dtype(at::kFloat));

  // Perform inference (can be called from multiple threads)
  model(input);

  // Get output tensor
  at::Tensor result = ipipe::any_cast<at::Tensor>(input->at(ipipe::TASK_RESULT_KEY));

  return 0;
}

Assuming that we want to support a maximum of 10 clients/concurrent requests, the instance_num is usually set to 2, so that we can handle up to instance_num * max = 8 requests at most.

Performance Trade-offs

Please note that our acceleration assumes the following:

Copying data from CPU to CPU and from GPU to GPU within a single card is fast, consumes fewer resources, and can be ignored overall.

tip

This assumption is valid relative to data copying between heterogeneous devices and other computations. However, we will see later that in some special scenarios, this assumption may not hold, and corresponding avoidance measures are required.

Summary

At the interface level, we have achieved all of our predetermined goals:

Using frameworks such as TensorRT for model-specific acceleration
Avoiding frequent memory allocation: achieved through PyTorch's memory pool
Multiple instances and batching: achieved through a single-node scheduling backend
Optimizing data transmission: achieved by using torch.Tensor as the input and output carrier.

Parallel Inference of ResNet18

Thread-Safe Local Inference​

Performance tuning tips​

Performance Trade-offs​

Summary​

Thread-Safe Local Inference

Performance tuning tips

Performance Trade-offs

Summary