Consider using boost::atomic or boost::lockfree for multithreaded atomic operations
C++11 provides language-level support for this, but boost::atomic provides a portable way to accomplish this. Doing this could reduce our usage of mutex in a few places (memory pool, for example).
Specifically, look at spinlock, which should be a simple replacement for mutex.
Also look at boost::lockfree. This could be useful for memory pools as well, since it implements a lock-free queue and lock-free stack.