Launch CUDA kernel configuration ensuring maximum occupancy.
Currently, in Nebo, we launch a CUDA kernel using 16x16 grid and base the number of threads in each block based on the extents of the field. This may not lead to the best occupancy and hence performance.
Use cudaOccupancyMaxPotentialBlockSize
provided by cuda runtime to derive the best configuration.