Accelerate Vision AI with Batch Mode VC-6 and NVIDIA Nsight

2 views Source
Accelerate Vision AI with Batch Mode VC-6 and NVIDIA Nsight

In Vision AI systems, model throughput continues to improve, and all data processing stages, including decoding, preprocessing, and GPU scheduling, must keep pace with these changes. In a previous article, the performance mismatch between pipeline stages was referred to as the data-to-tensor gap. The SMPTE VC-6 (ST 2117-1) codec addresses this issue through a hierarchical, tile-based architecture. Images are encoded as progressively refinable Levels of Quality (LoQs), each adding incremental detail. This allows selective retrieval and decoding of only the necessary resolution, region of interest, or color plane, providing random access to independently decodable frames. Thus, pipelines can retrieve and decode only what the model requires.

However, efficient execution with a single image does not automatically translate to efficient scaling. As batch sizes increase, the bottleneck shifts from single-image kernel efficiency to workload orchestration, launch cadence, and GPU occupancy. This article focuses on the architectural changes required to scale VC-6 decoding for batched inference and training workloads. Tools like NVIDIA Nsight Systems and NVIDIA Nsight Compute allow developers to identify system- and kernel-level constraints, which were leveraged to redesign the VC-6 CUDA implementation for batch throughput.

The result is up to 85% lower per-image decode time compared to the previous implementation, with sub-millisecond decode for LoQ-0 (~4K) in batch mode and ~0.2 ms for lower LoQs, while maintaining identical output quality. This significantly enhances pipeline efficiency for production Vision AI workloads. The new implementation is built around several architectural changes, including batch mode and kernel-level optimizations. The batch mode allows decoding multiple images simultaneously with a single decoder. Improved parallelization leverages the new work dimension (images) alongside existing parallelization dimensions (tiles, planes) to shift initial VC-6 tile hierarchy work to the GPU.

Nsight Compute-driven optimizations led to a ~20% kernel speedup. The following sections detail these changes to the VC-6 decoder. As with any CUDA optimization, the plan was to start with a system-level profiler like Nsight Systems to identify and fix initial performance bottlenecks, and then use Nsight Compute to refine individual kernels. The transition from N decoders for N images to a single decoder that decodes batches of N images at once redistributes the fixed amount of work into fewer kernels, each with more work. This change leads to full GPU utilization and eliminates inefficiencies associated with launching numerous small kernels.

In the initial implementation, decoding the root and narrow levels of the VC-6 tile hierarchies were performed on the CPU. For single-image decoding, the amount of work in these narrower stages was too small to justify GPU execution. However, in the batched design, the aggregation of multiple images provides sufficient parallelism for efficient GPU utilization. Additionally, the algorithm was modified to eliminate host-side logic for handling variable image dimensions. This reduced both synchronization points and submission latency, while increasing pipeline fluidity. The new decoder design splits each batch into minibatches that go through a pipeline consisting of CPU processing, PCIe transfer, and GPU decoding stages. Images of a minibatch reside in a pipeline stage simultaneously, while stages operate concurrently and hide each other’s costs.

Related articles