Skip to main content

Parallel Inference of ResNet18

Thread-Safe Local Inference

For convenience, let's assume that the TensorRT inference functionality is encapsulated as a "computational backend" named TensorrtTensor. Since the computation occurs on the GPU device, we add SyncTensor to represent the stream synchronization operation on the GPU.

ConfigurationParameterDescription
backend"SyncTensor[TensorrtTensor]"The computational backend, like TensorRT inference itself, is not thread-safe.
max4The maximum batch size supported by the model, used for model conversion (ONNX->TensorRT).

If a function can be called from multiple threads simultaneously, it is thread-safe. Providing a thread-safe interface will greatly facilitate the users.

By default, TorchPipe wraps an extensible single-node scheduling backend on top of this "computational backend," which provides the following three basic capabilities:

  • Thread safety of the forward interface

  • Multi-instance parallelism

    ConfigurationDefaultDescription
    instance_num1Perform inference tasks in parallel with multiple model instances.
  • Batching

    For ResNet18, the model itself takes -1x3x224x224 as input. The larger the batch size, the more tasks can be completed per unit of hardware resources. The batch size is read from the "computational backend" (TensorrtTensor).

    ConfigurationDefaultDescription
    batching_timeout0The timeout in milliseconds. If no requests for the number of batch sizes are received within this time, the waiting is abandoned.

Performance tuning tips

Summarizing the above steps, we obtain the necessary parameters for inference of ResNet18 under TorchPipe:

import torchpipe as tp
import torch

config = {
# Single-node scheduler parameters:
"instance_num": 2,
"batching_timeout": 5,
# Computational backend:
"backend": "SyncTensor[TensorrtTensor]",
# Computational backend parameters:
"model": "resnet18_-1x3x224x224.onnx",
"max": 4
}

# Initialization
models = tp.pipe(config)
data = torch.ones(1, 3, 224, 224).cuda()

## Forward
input = {"data": data}
models(input) # <== Can be called from multiple threads
result: torch.Tensor = input["result"] # "result" does not exist if the inference failed

Assuming that we want to support a maximum of 10 clients/concurrent requests, the instance_num is usually set to 2, so that we can handle up to instance_num * max = 8 requests at most.

Performance Trade-offs

Please note that our acceleration assumes the following:

Copying data from CPU to CPU and from GPU to GPU within a single card is fast, consumes fewer resources, and can be ignored overall.

tip

This assumption is valid relative to data copying between heterogeneous devices and other computations. However, we will see later that in some special scenarios, this assumption may not hold, and corresponding avoidance measures are required.

Summary

At the interface level, we have achieved all of our predetermined goals:

  • Using frameworks such as TensorRT for model-specific acceleration
  • Avoiding frequent memory allocation: achieved through PyTorch's memory pool
  • Multiple instances and batching: achieved through a single-node scheduling backend
  • Optimizing data transmission: achieved by using torch.Tensor as the input and output carrier.