Explore Kokkos backend for Nebo
Early exploration steps:
Roll out backend with support only for basic operations such as
- Perform basic performance comparison between nebo and kokkos for these basic operations on serial, multithreaded and GPU platforms.
- explore path forward for stencil integration (shouldn't be much more work than the first part above)
consider how we can make
- Determine why Kokkos integration fails with pow int on GPU (use NeboTest.cpp)
- Fix CMake such that Kokkos project is added correctly (built when necessary and no longer require two builds).
- Change check for header guard to something better to determine if Kokkos is included in KokkosIntegration.cpp
- Kokkos does its own threading and our code's threading library may interfere. Can probably remove boost threads. Currently the threadpool and related code is commented out. Will need to remove properly (currently threadpool commented out code in ThreadPool.h, ThreadPool.cpp, and SpatialOpsTools.h at least).
- Currently standing issue in that Kokkos requires an explicit call to KokkosInitialize() and Nebo has no such explicit initialize function. Auto initialization works on CPU but when doing it with CUDA it clears GPU memory it seems. This implies we cannot easily auto initialize when using CUDA since we do not know if the user of Nebo has put important data into memory or not. Probably need to add an explicit initialization function that needs to be called by user code to Nebo. I did some work to get auto initialize to compile with CUDA enabled, and can be found in the attached file AttemptNeboAutoInitializeCUDA.patch. I do not suggest going down that route though, as I have spent a lot of time on it and found no solution.AttemptNeboAutoInitializeCUDA.patch
- Merge in master and update Nebo core with code that adds template compile time options. May be inlining performance issues.
- Change Nebo such that it doesn't have different modes for different backends by default. Only need to have one backend that runs with KOKKOS_INLINE (this may be too aggressive).
- Use a single Kokkos wrapper functor that calls Nebo code with Kokkos. Code should be able to work on device and host naturally. This can't be done yet since the CUDA code and Serial code are separate throughout Nebo and are not marked device and host.
- Figure out a way to allow custom device and host code if given (think you can give host and device to different functions with same name).
- Integrate Kokkos views into memory backend in SpatialField. Allow external code to pass in a Kokkos view.
- Use team and vector policies if they seem appropriate.
- Switch from flat index conversion to triple index provided by Kokkos.
- Deal with GPU and Threaded synchronization between fields used in consecutive Nebo statements.
- Look into proper use of Cuda streams via Kokkos