Launch CUDA kernel configuration ensuring maximum occupancy.

Currently, in Nebo, we launch a CUDA kernel using 16x16 grid and base the number of threads in each block based on the extents of the field. This may not lead to the best occupancy and hence performance.

Use cudaOccupancyMaxPotentialBlockSize provided by cuda runtime to derive the best configuration.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information