SpatialOps issueshttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues2018-06-07T15:54:46Zhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/58Take mask objects by reference in nebo to avoid repeated copies in NeboStenci...2018-06-07T15:54:46ZJames SutherlandTake mask objects by reference in nebo to avoid repeated copies in NeboStencilBuilderSee, for example: [here](spatialops/NeboStencilBuilder.h#L963)See, for example: [here](spatialops/NeboStencilBuilder.h#L963)James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/57Check build and testing once Kokkos supports GPU execution on multiuser systems.2018-01-12T04:50:16ZSiddartha RavichandranCheck build and testing once Kokkos supports GPU execution on multiuser systems.Issue raised with Kokkos : https://github.com/kokkos/kokkos/issues/1305.
This issue is with regards to the integration of Kokkos into Nebo. As far as GPU execution is concerned, the device memory was completely cleared as and when `Kokk...Issue raised with Kokkos : https://github.com/kokkos/kokkos/issues/1305.
This issue is with regards to the integration of Kokkos into Nebo. As far as GPU execution is concerned, the device memory was completely cleared as and when `Kokkos::Finalize()` was called causing problems while running multiple Nebo programs as it could erase relevant device memory needed by other programs or even could erase device memory not managed by the user's programs.
I believe this [CUDAIssuePatch.patch] (/uploads/841584f03cd2bd8af55c8c772f0d71f3/CUDAIssuePatch.patch) will be a good starting point once the issue is resolved. Once this patch is applied, we can compile without the need to set NEBO_AUTO_INITIALIZE to OFF for CUDA.https://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/56local() and mapped_value() operators (as part of mapped reduction) only work ...2018-01-10T18:27:28ZSiddartha Ravichandranlocal() and mapped_value() operators (as part of mapped reduction) only work using native nebo backend.The `local()` and `mapped_value()` Nebo operators introduced as part of the `NeboMappedReduction` operation, do not work on Kokkos because they depend on the outer index that is determined as part of the outer loop in case of the native ...The `local()` and `mapped_value()` Nebo operators introduced as part of the `NeboMappedReduction` operation, do not work on Kokkos because they depend on the outer index that is determined as part of the outer loop in case of the native nebo backend. Since we lose control of the outer loop when using Kokkos, information needed to drive `local()` and `mapped_value()` operators are no longer available.https://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/8Field reductions are slow on GPU2018-02-25T20:43:48ZJames SutherlandField reductions are slow on GPUNathan indicates that GPU performance of field reductions is very poor (possibly slower than a transfer to CPU and back):
It is giving the correct answer. However, it was slower than copying to the cpu and then doing the reduction the...Nathan indicates that GPU performance of field reductions is very poor (possibly slower than a transfer to CPU and back):
It is giving the correct answer. However, it was slower than copying to the cpu and then doing the reduction there. We should merge it for testing and verification, but it isn't ready for practical applications yet.
Here is an online tutorial on some reduction techniques: [reduction.pdf](https://software.crsim.utah.edu:8443/James_Research_Group/SpatialOps/uploads/f2c02ef24e5e5f2d072531f9f817ee06/reduction.pdf)
[And another link here](http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler)
Note that Hao implemented some of this on the gpu-reductions branch, but this involved some additional syntax. He never saw this through...James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/3Fix bug in using threads and GPU Nebo backends the same time2018-02-25T20:43:48ZJames SutherlandFix bug in using threads and GPU Nebo backends the same timeFirst reported by Chris Earl in May, 2014
This bug only appears on certain systems (prism and a few laptops). To reproduce the bug, set `ENABLE_THREADS=ON` and `ENABLE_CUDA=ON` during configuration.
Example errors:
```
../libspa...First reported by Chris Earl in May, 2014
This bug only appears on certain systems (prism and a few laptops). To reproduce the bug, set `ENABLE_THREADS=ON` and `ENABLE_CUDA=ON` during configuration.
Example errors:
```
../libspatialops-structured.a(spatialops-structured_generated_CudaMemoryAllocator.cu.o): In function `_GLOBAL__sub_I_tmpxft_000016bb_00000000_3_CudaMemoryAllocator.cudafe1.cpp':
tmpxft_000016bb_00000000-3_CudaMemoryAllocator.cudafe1.cpp:(.text.startup+0x6b): undefined reference to `boost::system::generic_category()'
tmpxft_000016bb_00000000-3_CudaMemoryAllocator.cudafe1.cpp:(.text.startup+0x77): undefined reference to `boost::system::generic_category()'
tmpxft_000016bb_00000000-3_CudaMemoryAllocator.cudafe1.cpp:(.text.startup+0x83): undefined reference to `boost::system::system_category()'
collect2: error: ld returned 1 exit status
```
These errors imply there is a problem with how boost and CudaMemoryAllocator.cu interact.https://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/38Explore usage of C++11 thread tools rather than reliance on boost::thread2018-02-25T20:43:48ZJames SutherlandExplore usage of C++11 thread tools rather than reliance on boost::thread- [ ] Ensure that there is no performance problem with std::mutex, std::thread, etc. relative to boost::thread.
- [ ] Ensure that there is no performance problem with std::mutex, std::thread, etc. relative to boost::thread.
https://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/51Consider using std::thread rather than boost::thread2017-04-12T17:34:24ZJames SutherlandConsider using std::thread rather than boost::threadC++11 introduced [std::thread](http://en.cppreference.com/w/cpp/thread), which we should consider using instead of [boost::thread](http://www.boost.org/doc/libs/1_63_0/doc/html/thread.html).C++11 introduced [std::thread](http://en.cppreference.com/w/cpp/thread), which we should consider using instead of [boost::thread](http://www.boost.org/doc/libs/1_63_0/doc/html/thread.html).https://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/12Support dense linear algebra2018-02-25T20:43:48ZJames SutherlandSupport dense linear algebra# Overall Goal
We need to support something like this:
```cpp
Matrix<FieldT> mat;
// matrix assembly:
for( size_t irow=0; irow<nrow; ++irow ){
for( size_t icol=0; icol<ncol; ++icol ){
mat[irow][icol] <<= ...
}
}
/...# Overall Goal
We need to support something like this:
```cpp
Matrix<FieldT> mat;
// matrix assembly:
for( size_t irow=0; irow<nrow; ++irow ){
for( size_t icol=0; icol<ncol; ++icol ){
mat[irow][icol] <<= ...
}
}
// solve pointwise Ax=b problem
// and store the result in a field:
solution <<= mat.solve( rhs );
// alternatively: mat.solve( rhs, solution );
// eigenvalue decomposition
vector<FieldT*> eigVals;
mat.eigenvalues( eigVals );
```
This should dispatch to GPU or CPU as appropriate (similar to what nebo currently does for field operations).
-------
# Milestones/SubTasks
- [ ] regression testing
- [ ] verify consistency of row and column indexes in all class members
- [ ] add support for eigenvalues
- [ ] use an actively developed library; uBlas was easy to add, but is 7 years old
- [ ] improve performance by removing unnecessary data transfers and function calls
- [ ] support parallel CPU execution
- [ ] support GPU execution
James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/37SpatialField Operator= Right Hand GPU Synchronization2018-02-25T20:43:48ZJames SutherlandSpatialField Operator= Right Hand GPU SynchronizationThere is likely a bug in the operator= of SpatialField whenever the right hand side of the operator= is a SpatialField that lives on the GPU.
Currently, we copy a field on the right hand side on the GPU via a CUDA memcpy call with the...There is likely a bug in the operator= of SpatialField whenever the right hand side of the operator= is a SpatialField that lives on the GPU.
Currently, we copy a field on the right hand side on the GPU via a CUDA memcpy call with the left hand side's stream passed into it. This does not consider the right hand side's stream and therefore likely does not wait for operations to finish on the right hand side before starting the copy into the left hand side.
To fix this, the right hand side's stream should be used during the memcpy instead of the left hand's stream. If the left hand side is also on the GPU, then an additional step is needed to synchronize both the left hand and right hand streams. An example of how synchronizing multiple streams may be done can be found in spatialops/NeboLhs.h:gpu_assign().
## Steps: ##
- [ ] Create test that verifies issue exists
- [ ] Synchronize GPU -> CPU operator=
- [ ] Synchronize GPU -> GPU operator=
- [ ] Verify issue no longer exists
James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/15Consider using boost::atomic or boost::lockfree for multithreaded atomic oper...2018-02-25T20:43:48ZJames SutherlandConsider using boost::atomic or boost::lockfree for multithreaded atomic operationsC++11 provides language-level support for this, but [boost::atomic](http://www.boost.org/doc/libs/1_58_0/doc/html/atomic.html) provides a portable way to accomplish this. Doing this could reduce our usage of mutex in a few places (memor...C++11 provides language-level support for this, but [boost::atomic](http://www.boost.org/doc/libs/1_58_0/doc/html/atomic.html) provides a portable way to accomplish this. Doing this could reduce our usage of mutex in a few places (memory pool, for example).
Specifically, look at [spinlock](http://www.boost.org/doc/libs/1_58_0/doc/html/atomic/usage_examples.html#boost_atomic.usage_examples.example_spinlock), which should be a simple replacement for mutex.
Also look at [boost::lockfree](http://www.boost.org/doc/libs/1_58_0/doc/html/lockfree.html). This could be useful for memory pools as well, since it implements a [lock-free queue](http://www.boost.org/doc/libs/1_58_0/doc/html/boost/lockfree/queue.html) and [lock-free stack](http://www.boost.org/doc/libs/1_58_0/doc/html/boost/lockfree/stack.html).James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/42Issues with std::pow in Nebo for GPU2018-02-25T20:43:48ZJames SutherlandIssues with std::pow in Nebo for GPUDav de St. Germain (dav@sci.utah.edu) reported this issue when building Wasatch.
## Description
It appears that in `include/spatialops/NeboOperators.h` there are 4 (ish) locations where `pow()` is used with an integer as the 2nd arg...Dav de St. Germain (dav@sci.utah.edu) reported this issue when building Wasatch.
## Description
It appears that in `include/spatialops/NeboOperators.h` there are 4 (ish) locations where `pow()` is used with an integer as the 2nd argument.
I believe that NVCC will replace the `std::pow(double, double)` with its own version (device version), but it will not do the same with `pow( double, int )`.
## Current workaround
The work-around I am currently using is to cast the 2nd argument of the `pow()` calls in `NeboOperators.h` to `double`s. I believe that this makes the pow call "much slower" (then raising to an integer power). What I don't know is how this issues should be fixed (or who is the CUDA expert that could do so), or where to fix it. Also, it is possible that the `pow()` is not called very often which means that fixing may not even really be necessary.
## Post-commit action
The wasatch3p build script currently hacks `NeboOperators.h`to get it compiling with CUDA on newer architectures (e.g., Titan). We should remove that once this issue is completed.James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/11Nebo Marks: slicing arrays2018-02-25T20:43:48ZJames SutherlandNebo Marks: slicing arraysChris implemented most of the internals in the master branch
Needs more testing and hardening, along with API implementation
There are really two parts to this:
1. "Marks" [PlannedNeboFeatures-Marks.pdf](https://software.crsim.ut...Chris implemented most of the internals in the master branch
Needs more testing and hardening, along with API implementation
There are really two parts to this:
1. "Marks" [PlannedNeboFeatures-Marks.pdf](https://software.crsim.utah.edu:8443/James_Research_Group/SpatialOps/uploads/d6456b4c7b4400ae8a2068946a6d90f2/PlannedNeboFeatures-Marks.pdf)
1. "Slices" [PlannedNeboFeatures-slices.pdf](https://software.crsim.utah.edu:8443/James_Research_Group/SpatialOps/uploads/05e4ee6ec3bcdcc5b432fbce05e0ad10/PlannedNeboFeatures-slices.pdf)
James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/16Introspect core count in SpatialOps2018-02-25T20:43:48ZJames SutherlandIntrospect core count in SpatialOps# Compile-time introspection:
CMake provides a way to [determine processor counts](http://www.cmake.org/cmake/help/v3.0/module/ProcessorCount.html). See also [this blog post](http://www.kitware.com/blog/home/post/63).
We could levera...# Compile-time introspection:
CMake provides a way to [determine processor counts](http://www.cmake.org/cmake/help/v3.0/module/ProcessorCount.html). See also [this blog post](http://www.kitware.com/blog/home/post/63).
We could leverage this to help auto-populate the number of threads for SpatialOps. This could, in turn, be used in ExprLib.
# Runtime introspection
Several approaches are given [here](http://stackoverflow.com/questions/150355/programmatically-find-the-number-of-cores-on-a-machine).
# Other considerations
Once the threadcommunicator branch is merged, we have a few things to note:
- The number of threads in ExprLib and SpatialOps are multiplicative, and should never exceed the physical core count on the machine.
- The core count per socket should be divisible by the SpatialOps thread count.
- Thread count should generally not exceed the number of cores per socket if ExprLib is built on top of SpatialOps.
*Note also that execution will halt in the threadcommunicator branch if the number of threads exceeds the number of cores. This could be fixed if we can guarantee that the threadpool is not sized to exceed the physical core count.*James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/21std::isnan is problematic for some NVCC versions2018-02-25T20:43:49ZJames Sutherlandstd::isnan is problematic for some NVCC versions## Problem description
In [FieldComparisons.h](spatialops/structured/FieldComparisons.h), we check for NaN in equality comparison. Nathan wanted this for better behavior.
However, it appears that some versions of NVCC do not suppor...## Problem description
In [FieldComparisons.h](spatialops/structured/FieldComparisons.h), we check for NaN in equality comparison. Nathan wanted this for better behavior.
However, it appears that some versions of NVCC do not support this. Notably, prism fails to compile ExprLib when CUDA builds are active.
| Machine | nvcc Version | Comments |
| :-----: | :----------: | :------: |
| prism | 6.0.1 | fails to compile std::isnan |
| aurora | 6.5.12 | compiles without problem |
## Sample compiler error
Here is a sample compiler error (from building ExprLib on prism):
```
/scratch/local/prism_fast/jcs/ExprLib/buildCuda/so/include/spatialops/structured/FieldComparisons.h(156): error: expected an identifier
detected during instantiation of "__nv_bool SpatialOps::field_equal(const FieldT &, const FieldT &, double) [with FieldT=FieldT]"
/scratch/local/prism_fast/jcs/ExprLib/buildCuda/test/FieldMgr/main.cpp.cu(40): here
```
## Possible workaround
One possible solution is to pull in the NVCC compiler version when compiling SpatialOps and then only performing the `isnan` checks if the compiler version is high enough.James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/35FieldVector norms2018-02-25T20:43:49ZMike HansenFieldVector normsSuppose we have a FieldVector with ten elements in the vector that sits on a grid of 1000 points. I want a function that takes a norm over the ten elements at each point, producing a single value at each of the 1000 grid points. This is ...Suppose we have a FieldVector with ten elements in the vector that sits on a grid of 1000 points. I want a function that takes a norm over the ten elements at each point, producing a single value at each of the 1000 grid points. This is different than taking a norm of each element over the whole grid, which would produce a single value for each of the ten elements.
I would like to have support for the L1 norm (sum of absolute values), L2 norm (square root of sum of squares), L-infinity norm (max of absolute values), as well as minimum and maximum.
**Example of L2 Norm:**
List:
```c++
Vector3(1, 0, 0)
Vector3(2, 0, 0)
Vector3(3, 0, 0)
Vector3(4, 0, 0)
```
`L2Norm`: List: 1, 2, 3, 4
**Example of min, max:**
List:
```c++
Vector3(1, -1, 0)
Vector3(2, -1, -2)
Vector3(10, -100, 20)
```
`VectorMax`: List: 1, 2, 20
`VectorMin`: List: -1, -2, -100
_Clarification from Michael Brown_: As you mentioned you would like to take the norm over the elements of a FieldVector at each grid point. I believe this to mean that, thinking of the FieldVector as a list of vectors, that you would like to compute the norm of each vector and have it returned in a list of scalar values (this would pop out as a Field given the current situation of data in a FieldVector).
James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/45Allow FieldMatrix and FieldVector to hold const SpatialFldPtr<FieldT>2018-02-25T20:43:49ZJames SutherlandAllow FieldMatrix and FieldVector to hold const SpatialFldPtr<FieldT>In some cases, we may only have `const SpatialFldPtr<FieldT>` that we want to use to build the `FieldMatrix` and `FieldVector` objects, but we still want to perform operations like a linear solve. We currently don't support this.
@ma...In some cases, we may only have `const SpatialFldPtr<FieldT>` that we want to use to build the `FieldMatrix` and `FieldVector` objects, but we still want to perform operations like a linear solve. We currently don't support this.
@mahanse will need this very soon.James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/2Invalidating ghost cells in Nebo2018-02-25T20:43:49ZJames SutherlandInvalidating ghost cells in NeboWe planned and designed Nebo to invalidate ghost cells that it cannot populate with valid results because of stencil operations. However, invalidating these ghost cells breaks regression tests for pretty much any test that uses stencils....We planned and designed Nebo to invalidate ghost cells that it cannot populate with valid results because of stencil operations. However, invalidating these ghost cells breaks regression tests for pretty much any test that uses stencils.
Thus Nebo currently does NOT invalidate ghost cells.
We need to change this and update all tests that fail as a result.
This is currently implemented on the `invalid-ghost` branch, but needs cleanup & merge.
See also #7 , which is closely related to this issue.https://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/24Support for GPU particle interpolants2018-02-25T20:43:49ZTony SaadSupport for GPU particle interpolantsAttached is a cuda file that Sahana developed to implement the apply_to_field particle interpolants on the GPU. However, Sahana was unable to incorporate this into the SpatialOps build system.
[ParticleOperators_gpu.cu](https://softwa...Attached is a cuda file that Sahana developed to implement the apply_to_field particle interpolants on the GPU. However, Sahana was unable to incorporate this into the SpatialOps build system.
[ParticleOperators_gpu.cu](https://software.crsim.utah.edu:8443/James_Research_Group/SpatialOps/uploads/8c1ba768524db9a71690e5bbdb94cfcb/ParticleOperators_gpu.cu)
James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/9Finish up stencil convention changes2018-02-25T20:43:49ZJames SutherlandFinish up stencil convention changesChris had been working on this prior to his departure.
It changes convention on how we specify extents/offsets for creating stencils
There is a branch `new-stencil-convention` that implements this, but it needs to be tested against all d...Chris had been working on this prior to his departure.
It changes convention on how we specify extents/offsets for creating stencils
There is a branch `new-stencil-convention` that implements this, but it needs to be tested against all downstream apps prior to merging.
Two things to be done here:
- [ ] Document the changes in Doxygen
- [ ] Test downstream apps (coordinate with app owner listed below)
- [ ] ExprLib (James)
- [ ] PoKiTT (Nathan)
- [ ] ODT (James or Josh)
- [ ] LBMS (James or Derek)
- [ ] Wasatch (Tony)
Basic workflow (apply for each downstream project):
1. Build new project that uses SpatialOps with master branch. Run tests - everything should pass.
1. Build new project with new-stencil-convention branch of SpatialOps. Run tests - not everything will pass.
1. Discuss failing tests with developers of that project. Help them fix failing tests.
1. Repeat with a new project.James SutherlandJames Sutherlandhttps://gitlab.multiscale.utah.edu/common/SpatialOps/-/issues/36Kokkos integration2018-02-25T20:43:49ZJames SutherlandKokkos integration# Explore Kokkos backend for Nebo #
## Early exploration steps: ##
- [x] Roll out backend with support only for basic operations such as `+` `-` `/` `*`
- [x] Perform basic performance comparison between nebo and kokkos for these basi...# Explore Kokkos backend for Nebo #
## Early exploration steps: ##
- [x] Roll out backend with support only for basic operations such as `+` `-` `/` `*`
- [x] Perform basic performance comparison between nebo and kokkos for these basic operations on serial, multithreaded and GPU platforms.
## Secondary steps: ##
- [ ] explore path forward for stencil integration (shouldn't be much more work than the first part above)
- [ ] consider how we can make `cond` work
## Road map: ##
- [ ] Determine why Kokkos integration fails with pow int on GPU (use NeboTest.cpp)
- [x] Fix CMake such that Kokkos project is added correctly (built when necessary and no
longer require two builds).
- [x] Change check for header guard to something better to determine if Kokkos is included in KokkosIntegration.cpp
- [x] Kokkos does its own threading and our code's threading library may interfere. Can probably remove boost threads. Currently the threadpool and related code is commented out. Will need to remove properly (currently threadpool commented out code in ThreadPool.h, ThreadPool.cpp, and SpatialOpsTools.h at least).
- [x] Currently standing issue in that Kokkos requires an explicit call to KokkosInitialize() and Nebo has no such explicit initialize function. Auto initialization works on CPU but when doing it with CUDA it clears GPU memory it seems. This implies we cannot easily auto initialize when using CUDA since we do not know if the user of Nebo has put important data into memory or not. Probably need to add an explicit initialization function that needs to be called by user code to Nebo. I did some work to get auto initialize to compile with CUDA enabled, and can be found in the attached file AttemptNeboAutoInitializeCUDA.patch. I do not suggest going down that route though, as I have spent a lot of time on it and found no solution.[AttemptNeboAutoInitializeCUDA.patch](/uploads/39f26522f41070b99e55381e5dbb0d9d/AttemptNeboAutoInitializeCUDA.patch)
- [x] Merge in master and update Nebo core with code that adds template compile time options. May be inlining performance issues.
- [ ] Change Nebo such that it doesn't have different modes for different backends by
default. Only need to have one backend that runs with KOKKOS_INLINE (this may be too aggressive).
- [ ] Use a single Kokkos wrapper functor that calls Nebo code with Kokkos. Code should be able to work on
device and host naturally. This can't be done yet since the CUDA code and Serial code are separate throughout Nebo and are not marked __device__ and __host__.
- [ ] Figure out a way to allow custom device and host code if given (think you can give
__host__ and __device__ to different functions with same name).
- [ ] Integrate Kokkos views into memory backend in SpatialField. Allow external code to pass in a Kokkos view.
- [ ] Use team and vector policies if they seem appropriate.
- [ ] Switch from flat index conversion to triple index provided by Kokkos.
- [ ] Deal with GPU and Threaded synchronization between fields used in consecutive Nebo statements.
- [ ] Look into proper use of Cuda streams via Kokkos
James SutherlandJames Sutherland