Skip to main content

CUDA: Streams and Concurrency

CUDA provides a consistent abstraction to control concurrent access, allowing users to maximize and fully utilize the resource capabilities of a single GPU device.

To effectively utilize GPU devices, we aim to:

  • Increase the number of business requests that can be handled by a single hardware resource
  • Maximize GPU utilization (assuming associated resource usage is low and latency is met)

To achieve this goal, let's briefly analyze the programming model of CUDA.

Hardware

Cuda Core is the main computing unit of the graphics card. Graphics cards with Pascal architecture or higher have thousands of CUDA cores, corresponding to multiple SM (Streaming Multiprocessor) units, which can be shared for a single task or used for different computing tasks.

Starting from the Volta architecture, NVIDIA introduced Tensor Cores specifically designed for deep learning.

In the Tesla T4 with Turing architecture, there are a total of 40 SMs that share 6MB of L2 cache.

gpu_t4

An SM consists of 64 FP32 arithmetic units and 8 Tensor Cores.

sm

For operator-level optimization of models, it is necessary to focus on more low-level optimization. However, for business use cases, both operator-level optimization (selecting a targeted computational backend) and overall analysis are required.

Reference links:

Software

CUDA Streams

A CUDA stream represents a GPU operation queue, and all tasks submitted to the GPU are specified with an execution stream. There is a default stream, stream 0, which serves as the default queue. The submit task operation itself can be asynchronous, and synchronizing the stream means that the CPU thread needs to be blocked until all tasks submitted to the queue have been completed.

Tasks between different streams can be executed in parallel using different hardware units or time-division concurrency.

CUDA Context

CUDA Streams/CUDA Context can be compared to threads/processes: multiple threads that allocate and call GPU resources belong to the same CUDA Context, which has its own isolated address space, and resources cannot be shared across contexts.

By default, when any API in the CUDA runtime software library is called for the first time in a process, the current process's unique CUDA context is automatically initialized.

The GPU can only switch to one context at a time, and by default, a process has one context, so multiple processes using the GPU cannot utilize the hardware simultaneously.

Virtualization

Because the GPU cannot execute tasks across CUDA contexts simultaneously, hardware utilization may not be high. In this case, virtualization methods can be used. A typical example is NVIDIA's official MPS (Multi Process Service), which actually starts an independent process to forward all tasks. The disadvantage of using this method is that isolation is somewhat compromised: once this process fails, all associated tasks will be affected.

Summary

To fully utilize the performance of the GPU, some measures can be taken:

  • Run multiple parallel instances to perform calculations.
  • Limit the tasks of a single graphics card to a single process to overcome the problem of insufficient resource utilization caused by the time-sharing feature of CUDA context.