Skip to main content

Performance Report

This article mainly lists the basic performance test experiments of torchpipe. Triton Inference Server is our main reference, but our goal is not to compete with it in terms of performance. The performance differences between frameworks are sometimes difficult to compare directly, and in some experiments, we adopt reasonable unfair comparisons to help users understand the characteristics of the two frameworks.

note

The chapter on Key Performance Evaluation Metrics for Services is required as a prelude.

Local Function vs. RPC Call

torchpipe mainly uses thread-safe local calls and binds to the pytorch front-end interface. This is one of its major features. It is not easy to compare its performance with frameworks such as triton that are strongly coupled with RPC. Therefore, we are prepared to adopt reasonable unfair comparisons. Nevertheless, we have optimized the test process as reasonably as possible, such as completing all data preparation in advance during initialization and reusing it.

In the experiment, we found that there is a certain performance gap between triton and thrift in RPC calls with large amounts of data that include serialization/deserialization. This makes it difficult for us to directly compare the scheduling and computing parts. Therefore, we adopt a comparative approach and first measure the performance of pure RPC services as a comparison baseline.

info

When the throughput of pure RPC services is one order of magnitude higher than that of pure compute backend services, it can be considered that the impact of RPC on the entire service is not significant; otherwise, the RPC part may have a significant impact.

  • Strong Coupling*: triton provides a C language local interface, but it is not the recommended way for users.

Hello, World!

We send a HelloWorld message and seek to return the same string.

After testing, we found that triton's handling of TYPE_STRING is not friendly. The speed difference between sending the same data in TYPE_STRING format and TYPE_UINT8 format is quite significant. Therefore, in the following experiments, when we send 'HelloWorld', we use the UINT8 type recommended by triton for sending, which converts it to the corresponding UINT8 type of ASCII code.

Idle Performance

This section is only used as a reference background for speed. The performance difference in this section does not have a significant impact on actual business (compute-intensive backend).

thrift uses a multi-threaded mode with python as the client/server language.

Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1

python backendc++ backend
torchpipieQPS: 65227
QPS: 70671
torchpipie+thriftQPS: 10020
QPS: 10444
thriftQPS: 14890
-
triton-cliQPS: 7342
-

Compared to the tens of thousands of concurrent requirements of the RPC framework itself, the QPS of torchpipe and triton themselves are not high. Visual domain services are compute-intensive and usually cannot be considered as high-concurrency services. The actual single-card QPS is generally between 100-2000; the above measured idle performance upper limit meets the requirements.

Multi-Client Pressure Test

This section is only used as a reference background for speed. The performance difference in this section does not have a significant impact on actual business (compute-intensive backend).

The difference is alleviated under client pressure.

Client 10/Compute Backend Instance 1/Timeout 0/Max Batch 1

python backendc++ backend
torchpipieQPS: 46655
QPS: 55949
torchpipie-thriftQPS: 13221QPS: 14977
thriftQPS: 17262
-
triton-cliQPS: 15039
-

Adaptive Number of Clients for Batch Aggregation

When the traffic is not enough to cover max_batch, the system will use timeout as a cost during the process of aggregating batches. This section tests the adaptive client situation under this condition.

This section uses the Python backend of torchpipe and triton.

Compute Backend Instance 1/Timeout 10ms/Max Batch 4

Number of Clientstorchpipetriton
1TP50: 0.25TP50: 10.5
2TP50: 10.25TP50: 10.57
4TP50: 0.34TP50: 0.5
10TP50: 0.82TP50: 0.97

The experiment verified that the conservative adaptive strategy adopted by torchpipe can reduce the waiting time for batch aggregation in certain situations.

Efficiency of Batch Aggregation when max_batch=1

Test the system efficiency when the maximum batch size is 1. The batch aggregation process should be ignored theoretically at this time.

This section uses the Python backend of torchpipe and triton.

Compute Backend Instance 1/Timeout 10ms/Max Batch 1

Number of Clientstorchpipietriton
1TP50: 0.09TP50: 0.23
10TP50: 0.64TP50: 1.26

Concurrency Performance of Python Backend (CPU)

torchpipe natively supports multi-threading, but this mode has significant flaws for Python compute backends due to the impact of the GIL lock.

TypeModeRemarks
GPU BackendMulti-threadingVirtualization is required under multiple processes, and there are problems with shared memory transfer.
CPU Backend without GIL lockMulti-threading available-
CPU Backend with GIL lockMulti-process/RPCUse the multiprocessing/Joblib library in the Python backend, or bridge third-party RPC frameworks.

The following table verifies the effectiveness of multi-process mode in handling GIL lock (the backend performs a cumulative operation of 1-100000 in Python):

Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1

torchpipe (Python backend)tritontorchpipe-multiprocess
QPS: 233QPS: 814QPS: 872

Comparison of Pure CPU Backend Efficiency

This section tests the performance of JPEG decoding. The input data is the original, undecoded JPEG image, and the output is a uint8 tensor of size 640x444x3.

In this section of the experiment, we explicitly exclude the performance difference brought by the RPC framework itself: the client sends us the data Hello, world!, we prepare the JPEG data ourselves on the server, and call

cv2.imdecode(raw_jpg, cv2.IMREAD_COLOR)

to decode, and then return Hello, world! to the client.

Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1

torchpipe-thrift (C++ backend)torchpipe-thrift (Python backend)torchpipe-thrift-multiprocesstriton-cli
QPS: 426QPS: 406QPS: -QPS: 339

Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1

torchpipe-thrift (C++ backend)torchpipe-thrift (Python backend)torchpipe-thrift-multiprocesstriton-cli
QPS: 1599QPS: 1334QPS: -1440

The experimental results are consistent with expectations: the efficiency under various methods is similar. Finally, with ten clients, torchpipe-thrift (C++ backend) > triton-cli > torchpipe-thrift (Python backend).

ResNet50

This section tests the efficiency of using TensorRT to process the ResNet50 model under normal circumstances. The client input is a 224x224x3 image data, and the output is 1x1000. As a control, we also provide a comparison of pure RPC efficiency: in this case, the server prepares the data in advance without performing any calculations.

Basic Test

Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1

torchpipe-thrift-rpctriton-rpctorchpipe-thrifttriton-cliRemarks
QPS: 7519QPS: 1819QPS: 381QPS: 358

Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1

torchpipe-thrift-rpctriton-rpctorchpipe-thrifttriton-cliRemarks
QPS: 11938QPS: 3355QPS: 720QPS: 722

Batch Aggregation

Client 10/Compute Backend Instance 2/Timeout 5/Max Batch 4

torchpipe-thrifttriton-cliRemarks
QPS: 907QPS: 898-

Real-world Testing

In real-world scenarios, Triton often needs to chain multiple nodes/models through RPC calls to the server, while TorchPipe is used for local calls. Let's take a look at the performance comparison between the two methods.

Basic Test

In the following test, both Triton and TorchPipe contain only one ResNet18 model, which is accelerated by FP32 TensorRT. The input data is shape: 3x224x224.

Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1

torchpipietriton-rpcRemarks
QPS: 559QPS: 462

Client 10/Compute Backend Instance 1/Timeout 0/Max Batch 1

torchpipietriton-rpcRemarks
QPS: 1303QPS: 907

As expected, TorchPipe's local calling method has certain advantages in single-card scenarios.