Parallel Inference of ResNet18
Thread-Safe Local Inference
For convenience, let's assume that the TensorRT inference functionality is encapsulated as a "computational backend" named TensorrtTensor
. Since the computation occurs on the GPU device, we add SyncTensor
to represent the stream synchronization operation on the GPU.
Configuration | Parameter | Description |
---|---|---|
backend | "SyncTensor[TensorrtTensor]" | The computational backend, like TensorRT inference itself, is not thread-safe. |
max | 4 | The maximum batch size supported by the model, used for model conversion (ONNX->TensorRT). |
If a function can be called from multiple threads simultaneously, it is thread-safe. Providing a thread-safe interface will greatly facilitate the users.
By default, TorchPipe wraps an extensible single-node scheduling backend on top of this "computational backend," which provides the following three basic capabilities:
Thread safety of the forward interface
Multi-instance parallelism
Configuration Default Description instance_num 1 Perform inference tasks in parallel with multiple model instances.
Batching
For ResNet18, the model itself takes
-1x3x224x224
as input. The larger the batch size, the more tasks can be completed per unit of hardware resources. The batch size is read from the "computational backend" (TensorrtTensor).Configuration Default Description batching_timeout 0 The timeout in milliseconds. If no requests for the number of batch sizes are received within this time, the waiting is abandoned.
Performance tuning tips
Summarizing the above steps, we obtain the necessary parameters for inference of ResNet18 under TorchPipe:
- Python
- C++
import torchpipe as tp
import torch
config = {
# Single-node scheduler parameters:
"instance_num": 2,
"batching_timeout": 5,
# Computational backend:
"backend": "SyncTensor[TensorrtTensor]",
# Computational backend parameters:
"model": "resnet18_-1x3x224x224.onnx",
"max": 4
}
# Initialization
models = tp.pipe(config)
data = torch.ones(1, 3, 224, 224).cuda()
## Forward
input = {"data": data}
models(input) # <== Can be called from multiple threads
result: torch.Tensor = input["result"] # "result" does not exist if the inference failed
#include "Interpreter.hpp"
#include "ATen/ATen.h"
int main() {
// Set up configuration for single-node scheduler and computational backend
std::unordered_map<std::string, std::string> config = {
{"instance_num", "2"},
{"batching_timeout", "5"},
{"backend", "SyncTensor[TensorrtTensor]"},
{"model", "resnet18_-1x3x224x224.onnx"},
{"max", "4"}
};
// Initialize interpreter model
ipipe::Interpreter model(config);
// Prepare input data
auto input = std::make_shared<std::unordered_map<std::string, ipipe::any>>();
(*input)["data"] = at::empty({1, 3, 224, 224}, at::TensorOptions().device(at::kCUDA, -1).dtype(at::kFloat));
// Perform inference (can be called from multiple threads)
model(input);
// Get output tensor
at::Tensor result = ipipe::any_cast<at::Tensor>(input->at(ipipe::TASK_RESULT_KEY));
return 0;
}
Assuming that we want to support a maximum of 10 clients/concurrent requests, the instance_num
is usually set to 2, so that we can handle up to instance_num * max = 8
requests at most.
Performance Trade-offs
Please note that our acceleration assumes the following:
Copying data from CPU to CPU and from GPU to GPU within a single card is fast, consumes fewer resources, and can be ignored overall.
This assumption is valid relative to data copying between heterogeneous devices and other computations. However, we will see later that in some special scenarios, this assumption may not hold, and corresponding avoidance measures are required.
Summary
At the interface level, we have achieved all of our predetermined goals:
- Using frameworks such as TensorRT for model-specific acceleration
- Avoiding frequent memory allocation: achieved through PyTorch's memory pool
- Multiple instances and batching: achieved through a single-node scheduling backend
- Optimizing data transmission: achieved by using
torch.Tensor
as the input and output carrier.