|Abstract||As the core density of future processors keeps increasing, MPI+Threads is becoming a promising program- ming model for large scale SMP clusters. Generally speaking, hybrid MPI+Threads runtime can largely improve intra-node parallelism and data sharing on shared-memory architectures. However, it does not help much on inter-node communication due to the inefficient integration of existing communication and threading libraries. More specifically, existing MPI+Threads runtime systems use coarse-grained locks to protect their thread safety, which leads to heavy lock contention and limit the scalability of the runtime. While kernel threads are efficient for intra-node parallelism, we found that they are too heavy for com- putation/communication overlap in an MPI+Threads runtime system. In this paper we propose a new way for asynchronous MPI communication with user-level threads (MPI+ULT). By enabling ULT context switching inside MPI, MPI communication in one ULT can overlap with computation or communication in other ULTs. MPI+ULT can be used for communication hiding in various scenarios, including MPI point-to-point, collective and one-sided calls. We use MPI+ULT in two applications, a high-performance conjugate gradient benchmark and a genome assembly application, to show how MPI+ULT can help effectively hide communication and reduce runtime overhead. Experiments show that our method helps improve the performance of these applications significantly.