Field reductions are slow on GPU
Nathan indicates that GPU performance of field reductions is very poor (possibly slower than a transfer to CPU and back):
It is giving the correct answer. However, it was slower than copying to the cpu and then doing the reduction there. We should merge it for testing and verification, but it isn't ready for practical applications yet.
Here is an online tutorial on some reduction techniques: reduction.pdf
Note that Hao implemented some of this on the gpu-reductions branch, but this involved some additional syntax. He never saw this through...