Tiling for improved performance?
Rather than our current memory decomposition, we could interleave thread access to memory. This may result in reduced memory contention for reads from main memory when performing stencil operations.
Tiling may also improve serial performance.
We may be able to accomplish this through clever use of MemoryWindows.