Performance Report
This article mainly lists the basic performance test experiments of torchpipe
. Triton Inference Server
is our main reference, but our goal is not to compete with it in terms of performance. The performance differences between frameworks are sometimes difficult to compare directly, and in some experiments, we adopt reasonable unfair comparisons to help users understand the characteristics of the two frameworks.
The chapter on Key Performance Evaluation Metrics for Services is required as a prelude.
Local Function vs. RPC
Call
torchpipe
mainly uses thread-safe local calls and binds to the pytorch
front-end interface. This is one of its major features. It is not easy to compare its performance with frameworks such as triton
that are strongly coupled with RPC
. Therefore, we are prepared to adopt reasonable unfair comparisons. Nevertheless, we have optimized the test process as reasonably as possible, such as completing all data preparation in advance during initialization and reusing it.
In the experiment, we found that there is a certain performance gap between triton
and thrift
in RPC calls with large amounts of data that include serialization/deserialization. This makes it difficult for us to directly compare the scheduling and computing parts. Therefore, we adopt a comparative approach and first measure the performance of pure RPC services as a comparison baseline.
When the throughput of pure RPC services is one order of magnitude higher than that of pure compute backend services, it can be considered that the impact of RPC on the entire service is not significant; otherwise, the RPC part may have a significant impact.
- Strong Coupling*:
triton
provides a C language local interface, but it is not the recommended way for users.
Hello, World!
We send a HelloWorld
message and seek to return the same string.
After testing, we found that triton
's handling of TYPE_STRING
is not friendly. The speed difference between sending the same data in TYPE_STRING
format and TYPE_UINT8
format is quite significant. Therefore, in the following experiments, when we send 'HelloWorld', we use the UINT8
type recommended by triton
for sending, which converts it to the corresponding UINT8
type of ASCII code.
Idle Performance
This section is only used as a reference background for speed. The performance difference in this section does not have a significant impact on actual business (compute-intensive backend).
thrift
uses a multi-threaded mode with python
as the client/server language.
Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1
python backend | c++ backend | |
---|---|---|
torchpipie | QPS: 65227 | QPS: 70671 |
torchpipie+thrift | QPS: 10020 | QPS: 10444 |
thrift | QPS: 14890 | - |
triton-cli | QPS: 7342 | - |
Compared to the tens of thousands of concurrent requirements of the RPC framework itself, the QPS of torchpipe
and triton
themselves are not high. Visual domain services are compute-intensive and usually cannot be considered as high-concurrency
services. The actual single-card QPS is generally between 100-2000; the above measured idle performance upper limit meets the requirements.
Multi-Client Pressure Test
This section is only used as a reference background for speed. The performance difference in this section does not have a significant impact on actual business (compute-intensive backend).
The difference is alleviated under client pressure.
Client 10/Compute Backend Instance 1/Timeout 0/Max Batch 1
python backend | c++ backend | |
---|---|---|
torchpipie | QPS: 46655 | QPS: 55949 |
torchpipie-thrift | QPS: 13221 | QPS: 14977 |
thrift | QPS: 17262 | - |
triton-cli | QPS: 15039 | - |
Adaptive Number of Clients for Batch Aggregation
When the traffic is not enough to cover max_batch
, the system will use timeout as a cost during the process of aggregating batches. This section tests the adaptive client situation under this condition.
This section uses the Python backend of torchpipe
and triton
.
Compute Backend Instance 1/Timeout 10ms/Max Batch 4
Number of Clients | torchpipe | triton |
---|---|---|
1 | TP50: 0.25 | TP50: 10.5 |
2 | TP50: 10.25 | TP50: 10.57 |
4 | TP50: 0.34 | TP50: 0.5 |
10 | TP50: 0.82 | TP50: 0.97 |
The experiment verified that the conservative adaptive strategy adopted by torchpipe
can reduce the waiting time for batch aggregation in certain situations.
Efficiency of Batch Aggregation when max_batch
=1
Test the system efficiency when the maximum batch size is 1. The batch aggregation process should be ignored theoretically at this time.
This section uses the Python backend of torchpipe
and triton
.
Compute Backend Instance 1/Timeout 10ms/Max Batch 1
Number of Clients | torchpipie | triton |
---|---|---|
1 | TP50: 0.09 | TP50: 0.23 |
10 | TP50: 0.64 | TP50: 1.26 |
Concurrency Performance of Python Backend (CPU)
torchpipe
natively supports multi-threading, but this mode has significant flaws for Python compute backends due to the impact of the GIL lock.
Type | Mode | Remarks |
---|---|---|
GPU Backend | Multi-threading | Virtualization is required under multiple processes, and there are problems with shared memory transfer. |
CPU Backend without GIL lock | Multi-threading available | - |
CPU Backend with GIL lock | Multi-process/RPC | Use the multiprocessing/Joblib library in the Python backend, or bridge third-party RPC frameworks. |
The following table verifies the effectiveness of multi-process mode in handling GIL lock (the backend performs a cumulative operation of 1-100000 in Python):
Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1
torchpipe (Python backend) | triton | torchpipe-multiprocess |
---|---|---|
QPS: 233 | QPS: 814 | QPS: 872 |
Comparison of Pure CPU Backend Efficiency
This section tests the performance of JPEG decoding. The input data is the original, undecoded JPEG image, and the output is a uint8
tensor of size 640x444x3.
In this section of the experiment, we explicitly exclude the performance difference brought by the RPC framework itself: the client sends us the data Hello, world!
, we prepare the JPEG data ourselves on the server, and call
- python后端
- c++后端
cv2.imdecode(raw_jpg, cv2.IMREAD_COLOR)
cv::Mat imdecode(cv::InputArray buf, int flags)
to decode, and then return Hello, world!
to the client.
Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1
torchpipe-thrift (C++ backend) | torchpipe-thrift (Python backend) | torchpipe-thrift-multiprocess | triton-cli |
---|---|---|---|
QPS: 426 | QPS: 406 | QPS: - | QPS: 339 |
Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1
torchpipe-thrift (C++ backend) | torchpipe-thrift (Python backend) | torchpipe-thrift-multiprocess | triton-cli |
---|---|---|---|
QPS: 1599 | QPS: 1334 | QPS: - | 1440 |
The experimental results are consistent with expectations: the efficiency under various methods is similar. Finally, with ten clients, torchpipe-thrift (C++ backend)
> triton-cli
> torchpipe-thrift (Python backend)
.
ResNet50
This section tests the efficiency of using TensorRT to process the ResNet50 model under normal circumstances. The client input is a 224x224x3 image data, and the output is 1x1000. As a control, we also provide a comparison of pure RPC efficiency: in this case, the server prepares the data in advance without performing any calculations.
Basic Test
Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1
torchpipe-thrift-rpc | triton-rpc | torchpipe-thrift | triton-cli | Remarks |
---|---|---|---|---|
QPS: 7519 | QPS: 1819 | QPS: 381 | QPS: 358 |
Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1
torchpipe-thrift-rpc | triton-rpc | torchpipe-thrift | triton-cli | Remarks |
---|---|---|---|---|
QPS: 11938 | QPS: 3355 | QPS: 720 | QPS: 722 |
Batch Aggregation
Client 10/Compute Backend Instance 2/Timeout 5/Max Batch 4
torchpipe-thrift | triton-cli | Remarks |
---|---|---|
QPS: 907 | QPS: 898 | - |
Real-world Testing
In real-world scenarios, Triton often needs to chain multiple nodes/models through RPC calls to the server, while TorchPipe is used for local calls. Let's take a look at the performance comparison between the two methods.
Basic Test
In the following test, both Triton and TorchPipe contain only one ResNet18 model, which is accelerated by FP32 TensorRT. The input data is shape: 3x224x224.
Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1
torchpipie | triton-rpc | Remarks |
---|---|---|
QPS: 559 | QPS: 462 |
Client 10/Compute Backend Instance 1/Timeout 0/Max Batch 1
torchpipie | triton-rpc | Remarks |
---|---|---|
QPS: 1303 | QPS: 907 |
As expected, TorchPipe's local calling method has certain advantages in single-card scenarios.