WorkLog: Optimizing Convolution

15 Jan, 2026

Convolution is a fundamental mathematical operation used extensively in signal processing, image processing and deep learning(particularly in Convolutional Neural Networks).

GPU Used: RTX 3060
Code is available at: Convolution in CUDA

Mathematical Definition

Continuous-time Convolution

For two functions f(t) and g(t):

(f * g) (t) = \int_{- \infty}^{\infty} f (τ) g (t - τ) d τ

where,

f(t) : input signal
g(t): impulse response/ kernel/ filter
(f * g)(t): convolution of f and g, i.e. the output signal

Discrete-time Convolution

y [n] = \sum_{k = - \infty}^{\infty} x [k] h [n - k]

where,

x[n]: input discrete signal
h[n]: impulse response / kernel
y[n]: output signal
h[n - k]:
- index reversal
- shift by n

In this post, I benchmark CUDA implementations of convolution and analyze their performance across 1D, 2D, and 3D convolution workloads.

1D Convolution

Constraints

1 ≤ input_size ≤ 1,500,000
1 ≤ kernel_size ≤ 2047
kernel_size ≤ input_size

Benchmarking

cuDNN

Setup/ Description Phase

/* cuDNN setup */
cudnnHandle_t cudnn; // Declare cuDNN execution context required for all cuDNN operations
CUDNN_CHECK(cudnnCreate(&cudnn));
cudnnTensorDescriptor_t xDesc, yDesc; // Declares tensor descriptors in cuDNN. Do not store data - only describe how it is laid out in memory
cudnnFilterDescriptor_t filterDesc;
cudnnConvolutionDescriptor_t convDesc;
CUDNN_CHECK(cudnnCreateTensorDescriptor(&xDesc));
CUDNN_CHECK(cudnnCreateTensorDescriptor(&yDesc));
CUDNN_CHECK(cudnnCreateFilterDescriptor(&filterDesc));
CUDNN_CHECK(cudnnCreateConvolutionDescriptor(&convDesc));

At this stage:
- No computation happens
- No memory is touched
Only telling cuDNN:
- What tensors, filters & convolution look like.
- Which GPU/stream to use.

Configuration Phase

// Set tensor and filter descriptors
CUDNN_CHECK(cudnnSetTensor4dDescriptor(xDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, 1, N, 1));
CUDNN_CHECK(cudnnSetTensor4dDescriptor(yDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, 1, outSize, 1));
CUDNN_CHECK(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, 1, 1, K, 1));
// Set convolution descriptor
CUDNN_CHECK(cudnnSetConvolution2dDescriptor(convDesc,
                                            0, 0, // pad height, width
                                            1, 1, // vertical, horizontal stride
                                            1, 1, // dilation height, width
                                            CUDNN_CROSS_CORRELATION,
                                            CUDNN_DATA_FLOAT));

Mode: CUDNN_CROSS_CORRELATION

Standard convolution operation as used in DL
No kernel flip

Compute Type: CUDNN_DATA_FLOAT

Precision: 32-bit floating point(FP32)

Execution Phase

cudnnConvolutionFwdAlgoPerf_t perf;
int returnedAlgoCount = 0;

CUDNN_CHECK(cudnnGetConvolutionForwardAlgorithm_v7(
            cudnn,
            xDesc,
            filterDesc,
            convDesc,
            yDesc,
            1, // max number of algorithms to return
            &returnedAlgoCount,
            &perf
));
cudnnConvolutionFwdAlgo_t algo = perf.algo;

// Workspace 
size_t workspaceSize = perf.memory;
void* d_workspace = nullptr;
if(workspaceSize > 0){
    CUDA_CHECK(cudaMalloc(&d_workspace, workspaceSize));
}

cudnnConvolutionFwdAlgoPerf_t
- struct contains everything cuDNN learned about one algorithm:

Field	Meaning
algo	Which algorithm was chosen
time	Estimated execution time (ms)
memory	Workspace size required
status	Whether this algo is valid
mathType	Tensor Core / default math

cudnnConvolutionFwdAlgo_t algo = perf.algo; -> locking in cuDNN’s best guess. For our case, cuDNN uses CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PPRECOMP_GEMM

Actual Execution point

CUDNN_CHECK(cudnnConvolutionForward(
                    cudnn,
                    &alpha,
                    xDesc,
                    d_input,
                    filterDesc,
                    d_filter,
                    convDesc,
                    algo,
                    d_workspace,
                    workspaceSize,
                    &beta,
                    yDesc,
                    d_output
                ));

Custom CUDA Kernels

Naive method

__global__ void convolution_1d_kernel(const float* input, const float* kernel, float* output,
                                      int input_size, int kernel_size) {

    int tid = threadIdx.x + blockDim.x * blockIdx.x;
    if(tid + kernel_size < input_size + 1){
        float temp = 0.0f;
        for(int i = 0;i < kernel_size;i++){
            temp += input[tid + i] * kernel[i];
        }
        output[tid] = temp;
    }    
}

Using shared memory

__global__ void convolution_1d_kernel_shared_mem(const float* input, const float* kernel, float* output,
                                      int input_size, int kernel_size) {
    
    extern __shared__ float sI[];   
    int tid = threadIdx.x;
    int base = blockDim.x * blockIdx.x;
    
    for(int i = tid; i < blockDim.x + kernel_size - 1; i+=blockDim.x){
        sI[i] = (base + i < input_size) ? input[base + i] : 0.0f;
        
    }
    __syncthreads();    

    int gid = base + tid;
    if(gid < input_size - kernel_size + 1){
        float temp = 0.0f;
        #pragma unroll
        for(int i = 0;i < kernel_size;i++){
            temp += kernel[i] * sI[tid + i];
        }
        output[gid] = temp;
    }

}

The same benchmarking methodology is extended to 2D and 3D convolutions, enabling a consistent comparison across dimensions.

Constraints for 2D Convolution

1 ≤ input_rows, input_cols ≤ 3072
1 ≤ kernel_rows, kernel_cols ≤ 31
kernel_rows ≤ input_rows
kernel_cols ≤ input_cols

Constraints for 3D Convolution

1 ≤ input_depth, input_rows, input_cols ≤ 256
1 ≤ kernel_depth, kernel_rows, kernel_cols ≤ 5
kernel_depth ≤ input_depth
kernel_rows ≤ input_rows
kernel_cols ≤ input_cols

Results

Figure 1: Latency Comparison of 1D convolution for different kernel sizes

Figure 2: Latency Comparison of 2D convolution for different kernel sizes

Figure 3: Latency Comparison of 3D convolution for different kernel sizes

Comments on the above results

Why cuDNN results are not what I was expecting?
- cuDNN is optimized for:
  - Large batch sizes
  - Multiple channels(C > 1)
  - End-to-end throughput, not single-call latency
  - In out benchmark, out setup is
    - Batch = 1
    - Channels = 1
    - Single convolution per measurement
    - Kernel size sweep(algorithm re-selection every time)
- Worst-case scenario for cuDNN

Future Exploration

Future work will involve profiling with NVIDIA Nsight Systems and NVIDIA Nsight Compute to break down execution time into kernel launches, memory stalls, and compute utilization, providing deeper insight into why cuDNN underperforms compared to custom kernels in low-channel convolution settings.

References

#1D #2D #3D #CUDA #convolution