Performance Report

This article mainly lists the basic performance test experiments of torchpipe. Triton Inference Server is our main reference, but our goal is not to compete with it in terms of performance. The performance differences between frameworks are sometimes difficult to compare directly, and in some experiments, we adopt reasonable unfair comparisons to help users understand the characteristics of the two frameworks.

note

The chapter on Key Performance Evaluation Metrics for Services is required as a prelude.

Local Function vs. `RPC` Call

torchpipe mainly uses thread-safe local calls and binds to the pytorch front-end interface. This is one of its major features. It is not easy to compare its performance with frameworks such as triton that are strongly coupled with RPC. Therefore, we are prepared to adopt reasonable unfair comparisons. Nevertheless, we have optimized the test process as reasonably as possible, such as completing all data preparation in advance during initialization and reusing it.

In the experiment, we found that there is a certain performance gap between triton and thrift in RPC calls with large amounts of data that include serialization/deserialization. This makes it difficult for us to directly compare the scheduling and computing parts. Therefore, we adopt a comparative approach and first measure the performance of pure RPC services as a comparison baseline.

info

When the throughput of pure RPC services is one order of magnitude higher than that of pure compute backend services, it can be considered that the impact of RPC on the entire service is not significant; otherwise, the RPC part may have a significant impact.

Strong Coupling*: triton provides a C language local interface, but it is not the recommended way for users.

Hello, World!

We send a HelloWorld message and seek to return the same string.

After testing, we found that triton's handling of TYPE_STRING is not friendly. The speed difference between sending the same data in TYPE_STRING format and TYPE_UINT8 format is quite significant. Therefore, in the following experiments, when we send 'HelloWorld', we use the UINT8 type recommended by triton for sending, which converts it to the corresponding UINT8 type of ASCII code.

Idle Performance

This section is only used as a reference background for speed. The performance difference in this section does not have a significant impact on actual business (compute-intensive backend).

thrift uses a multi-threaded mode with python as the client/server language.

Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1

	python backend	c++ backend
torchpipie	QPS: 65227	QPS: 70671
torchpipie+thrift	QPS: 10020	QPS: 10444
thrift	QPS: 14890	-
triton-cli	QPS: 7342	-

Compared to the tens of thousands of concurrent requirements of the RPC framework itself, the QPS of torchpipe and triton themselves are not high. Visual domain services are compute-intensive and usually cannot be considered as high-concurrency services. The actual single-card QPS is generally between 100-2000; the above measured idle performance upper limit meets the requirements.

Multi-Client Pressure Test

This section is only used as a reference background for speed. The performance difference in this section does not have a significant impact on actual business (compute-intensive backend).

The difference is alleviated under client pressure.

Client 10/Compute Backend Instance 1/Timeout 0/Max Batch 1

	python backend	c++ backend
torchpipie	QPS: 46655	QPS: 55949
torchpipie-thrift	QPS: 13221	QPS: 14977
thrift	QPS: 17262	-
triton-cli	QPS: 15039	-

Adaptive Number of Clients for Batch Aggregation

When the traffic is not enough to cover max_batch, the system will use timeout as a cost during the process of aggregating batches. This section tests the adaptive client situation under this condition.

This section uses the Python backend of torchpipe and triton.

Compute Backend Instance 1/Timeout 10ms/Max Batch 4

Number of Clients	torchpipe	triton
1	TP50: 0.25	TP50: 10.5
2	TP50: 10.25	TP50: 10.57
4	TP50: 0.34	TP50: 0.5
10	TP50: 0.82	TP50: 0.97

The experiment verified that the conservative adaptive strategy adopted by torchpipe can reduce the waiting time for batch aggregation in certain situations.

Efficiency of Batch Aggregation when `max_batch`=1

Test the system efficiency when the maximum batch size is 1. The batch aggregation process should be ignored theoretically at this time.

This section uses the Python backend of torchpipe and triton.

Compute Backend Instance 1/Timeout 10ms/Max Batch 1

Number of Clients	torchpipie	triton
1	TP50: 0.09	TP50: 0.23
10	TP50: 0.64	TP50: 1.26

Concurrency Performance of Python Backend (CPU)

torchpipe natively supports multi-threading, but this mode has significant flaws for Python compute backends due to the impact of the GIL lock.

Type	Mode	Remarks
GPU Backend	Multi-threading	Virtualization is required under multiple processes, and there are problems with shared memory transfer.
CPU Backend without GIL lock	Multi-threading available	-
CPU Backend with GIL lock	Multi-process/RPC	Use the `multiprocessing/Joblib` library in the Python backend, or bridge third-party RPC frameworks.

The following table verifies the effectiveness of multi-process mode in handling GIL lock (the backend performs a cumulative operation of 1-100000 in Python):

Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1

torchpipe (Python backend)	triton	torchpipe-multiprocess
QPS: 233	QPS: 814	QPS: 872

Comparison of Pure CPU Backend Efficiency

This section tests the performance of JPEG decoding. The input data is the original, undecoded JPEG image, and the output is a uint8 tensor of size 640x444x3.

In this section of the experiment, we explicitly exclude the performance difference brought by the RPC framework itself: the client sends us the data Hello, world!, we prepare the JPEG data ourselves on the server, and call

python后端
c++后端

cv2.imdecode(raw_jpg, cv2.IMREAD_COLOR)

cv::Mat imdecode(cv::InputArray buf, int flags)

to decode, and then return Hello, world! to the client.

Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1

torchpipe-thrift (C++ backend)	torchpipe-thrift (Python backend)	torchpipe-thrift-multiprocess	triton-cli
QPS: 426	QPS: 406	QPS: -	QPS: 339

Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1

torchpipe-thrift (C++ backend)	torchpipe-thrift (Python backend)	torchpipe-thrift-multiprocess	triton-cli
QPS: 1599	QPS: 1334	QPS: -	1440

The experimental results are consistent with expectations: the efficiency under various methods is similar. Finally, with ten clients, torchpipe-thrift (C++ backend) > triton-cli > torchpipe-thrift (Python backend).

ResNet50

This section tests the efficiency of using TensorRT to process the ResNet50 model under normal circumstances. The client input is a 224x224x3 image data, and the output is 1x1000. As a control, we also provide a comparison of pure RPC efficiency: in this case, the server prepares the data in advance without performing any calculations.

Basic Test

Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1

torchpipe-thrift-rpc	triton-rpc	torchpipe-thrift	triton-cli	Remarks
QPS: 7519	QPS: 1819	QPS: 381	QPS: 358

Client 10/Compute Backend Instance 10/Timeout 0/Max Batch 1

torchpipe-thrift-rpc	triton-rpc	torchpipe-thrift	triton-cli	Remarks
QPS: 11938	QPS: 3355	QPS: 720	QPS: 722

Batch Aggregation

Client 10/Compute Backend Instance 2/Timeout 5/Max Batch 4

torchpipe-thrift	triton-cli	Remarks
QPS: 907	QPS: 898	-

Real-world Testing

In real-world scenarios, Triton often needs to chain multiple nodes/models through RPC calls to the server, while TorchPipe is used for local calls. Let's take a look at the performance comparison between the two methods.

Basic Test

In the following test, both Triton and TorchPipe contain only one ResNet18 model, which is accelerated by FP32 TensorRT. The input data is shape: 3x224x224.

Client 1/Compute Backend Instance 1/Timeout 0/Max Batch 1

torchpipie	triton-rpc	Remarks
QPS: 559	QPS: 462

Client 10/Compute Backend Instance 1/Timeout 0/Max Batch 1

torchpipie	triton-rpc	Remarks
QPS: 1303	QPS: 907

As expected, TorchPipe's local calling method has certain advantages in single-card scenarios.

Performance Report

Local Function vs. RPC Call​

Hello, World!​

Idle Performance​

Multi-Client Pressure Test​

Adaptive Number of Clients for Batch Aggregation​

Efficiency of Batch Aggregation when max_batch=1​

Concurrency Performance of Python Backend (CPU)​

Comparison of Pure CPU Backend Efficiency​

ResNet50​

Basic Test​

Batch Aggregation​

Real-world Testing​

Basic Test​

Local Function vs. `RPC` Call

Hello, World!

Idle Performance

Multi-Client Pressure Test

Adaptive Number of Clients for Batch Aggregation

Efficiency of Batch Aggregation when `max_batch`=1

Concurrency Performance of Python Backend (CPU)

Comparison of Pure CPU Backend Efficiency

ResNet50

Basic Test

Batch Aggregation

Real-world Testing

Basic Test