Multiparadigm Parallel Acceleration for Reservoir Simulation
- Larry Siu-Kuen Fung (Saudi Aramco) | Mohamad Sindi (Saudi Aramco) | Ali H. Dogru (Saudi Aramco)
- Document ID
- Society of Petroleum Engineers
- SPE Journal
- Publication Date
- August 2014
- Document Type
- Journal Paper
- 716 - 725
- 2013.Society of Petroleum Engineers
- 7.2.1 Risk, Uncertainty and Risk Assessment, 5.3.4 Integration of geomechanics in models, 1.2.2 Geomechanics, 5.5 Reservoir Simulation, 1.2.3 Rock properties, 4.3.4 Scale, 1.6.9 Coring, Fishing
- Linear Solver, Parallel Acceleration, Reservoir Simulation, Multi-Paradigm , Heterogeneous Computing
- 6 in the last 30 days
- 373 since 2007
- Show more detail
- View rights & permissions
|SPE Member Price:||USD 10.00|
|SPE Non-Member Price:||USD 30.00|
With the advent of the multicore central-processing unit (CPU), today’s commodity PC clusters are effectively a collection of interconnected parallel computers, each with multiple multicore CPUs and large shared random access memory (RAM), connected together by means of high-speed networks. Each computer, referred to as a compute node, is a powerful parallel computer on its own. Each compute node can be equipped further with acceleration devices such as the general-purpose graphical processing unit (GPGPU) to further speed up computational-intensive portions of the simulator. Reservoir-simulation methods that can exploit this heterogeneous hardware system can be used to solve very-large-scale reservoir-simulation models and run significantly faster than conventional simulators. Because typical PC clusters are essentially distributed share-memory computers, this suggests that the use of mixed-paradigm parallelism (distributed-shared memory), such as message-passing interface and open multiprocessing (MPI-OMP), should work well for computational efficiency and memory use. In this work, we compare and contrast the single-paradigm programming models, MPI or OMP, with the mixed paradigm, MPI-OMP, programming model for a class of solver method that is suited for the different modes of parallelism. The results showed that the distributed memory (MPI-only) model has superior multicompute-node scalability, whereas the shared memory (OMP-only) model has superior parallel performance on a single compute node. The mixed MPI-OMP model and OMP-only model are more memory-efficient for the multicore architecture than the MPI-only model because they require less or no halo-cell storage for the subdomains. To exploit the fine-grain shared memory parallelism available on the GPGPU architecture, algorithms should be suited to the single-instruction multiple-data (SIMD) parallelism, and any recursive operations are serialized. In addition, solver methods and data store need to be reworked to coalesce memory access and to avoid shared memory-bank conflicts. Wherever possible, the cost of data transfer through the peripheral component interconnect express (PCIe) bus between the CPU and GPGPU needs to be hidden by means of asynchronous communication. We applied multiparadigm parallelism to accelerate compositional reservoir simulation on a GPGPU-equipped PC cluster. On a dual-CPU-dual-GPU compute node, the parallelized solver running on the dual-GPGPU Fermi M2090Q achieved up to 19 times speedup over the serial CPU (1-core) results and up to 3.7 times speedup over the parallel dual-CPU X5675 results in mixed MPI+OMP paradigm for a 1.728-million-cell compositional model. Parallel performance shows a strong dependency on the subdomain sizes. Parallel CPU solve has a higher performance for smaller domain partitions, whereas GPGPU solve requires large partitions for each chip for good parallel performance. This is related to improved cache efficiency on the CPU for small subdomains and the loading requirement for massive parallelism on the GPGPU. Therefore, for a given model, the multinode parallel performance decreases for the GPGPU relative to the CPU as the model is further subdivided into smaller subdomains to be solved on more compute nodes. To illustrate this, a modified SPE5 (Killough and Kossak 1987) model with various grid dimensions was run to generate comparative results. Parallel performances for three field compositional models of various sizes and dimensions are included to further elucidate and contrast CPU-GPGPU single-node and multiple-node performances. A PC cluster with the Tesla M2070Q GPGPU and the 6-core Xeon X5675 Westmere was used to produce the majority of the reported results. Another PC cluster with the Tesla M2090Q GPGPU was available for some cases, and the results are reported for the modified SPE5 (Killough and Kossack 1987) problems for comparison.
|File Size||948 KB||Number of Pages||10|
Appleyard, J.R., Appleyard, J.D., Wakefield, M.A. et al. 2011. Accelerating Reservoir Simulators Using GPU Technology. Paper SPE 141402 presented at the 2011 SPE Reservoir Symposium, Woodlands, Texas, 21–23 February. http://dx.doi.org/10.2118/141402-MS.
Christie, M.A. and Blunt, M.J. 2001. Tenth SPE Comparative Solution Project: A Comparison of Upscaling Techniques. Presented at the SPE Reservoir Simulation Symposium, Houston, 11–14 February. SPE-66599-MS. http://dx.doi.org/10.2118/66599-MS.
Feldman, M. 2012. Researchers Squeeze GPU Performance from 11 Big Science Apps. HPCwire (18 July 2012). http://archive.hpcwire.com/hpcwire/2012-07-18/researchers_squeeze_gpu_performance_from_11_big_science_apps.html.
Fung, L.S.K. and Dogru, A.H. 2008a. Parallel Unstructured-Solver Methods for Simulation of Complex Giant Reservoirs. SPE J. 13 (4): 440–446. http://dx.doi.org/10.2118/106237-PA.
Fung, L.S.K. and Dogru, A.H. 2008b. Distributed Unstructured Grid Infrastructure for Complex Reservoir Simulation. Paper SPE 113906 presented at the SPE Europec/EAGE Annual Conference and Exhibition, Rome, Italy, 9–12 June. http://dx.doi.org/10.2118/113906-MS.
Fung, L.S.K. and Mezghani, M.M. 2013. Machine, Computer Program Product and Method to Carry Out Parallel Reservoir Simulation. US Patent 8,433,551.
Killough, J.E. and Kossack, C.A. 1987. Fifth Comparative Solution Project: Evaluation of Miscible Flood Simulators. Presented at the SPE Symposium on Reservoir Simulation, San Antonio, Texas, 1–4 February. SPE-16000-MS. http://dx.doi.org/10.2118/16000-MS.
Klie, H, Sudan, H., Li, R. et al. 2011. Exploiting Capabilities of Many Core Platforms in Reservoir Simulation. Paper SPE 141265 presented at the 2011 SPE Reservoir Symposium, Woodlands, Texas, 21–23 February. http://dx.doi.org/10.2118/141265-MS.
MPI: A Message-Passing Interface Standard. 1995. Message Passing Interface Forum, http://www.mpi-forum.org (June12).
Network-Based Computing Laboratory. 2008. MVAPICH2 1.2 User Guide. Columbus, Ohio: Ohio State University. http://www.compsci.wm.edu/SciClone/documentation/software/communication/MVAPICH2-1.2/mvapich2-1.2rc2_user_guide.pdf.
NVIDIA. 2012a. CUDA C Best Practice Guide, Version 5.0, October.
NVIDIA. 2012b. CUDA C Programming Guide, Version 5.0, October.
NVIDIA. 2013. GPUDirect Technology, CUDA Toolkit, Version 5.5. (19 July 2013). https://developer.nvidia.com/gpudirect.
OpenACC. 2013. The OpenACC Application Programming Interface. Open-ACC Standard Organization, Version 2.0, June. http://www.openaccstandard.org.
OpenMP Application Program Interface. 2011. Version 3.1, July, http://www.openmp.org.
Saad, Y. and Schultz, M.H. 1986. GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems. SIAM J. Sci. Stat. Comput. 7 (3): 856–869.
The Portland Group. 2011. CUDA FORTRAN Programming Guide and Reference, Release 2011, Version 11.8, August.
Vinsome, P.K.W. 1976. Orthomin, An Iterative Method for Solving Sparse Sets of Simultaneous Linear Equations. Paper SPE 5729 presented at the 4th SPE Symposium of Numerical Simulation for Reservoir Performance, Los Angeles, California, 19–20 February. http://dx.doi.org/10.2118/5729-MS.
Vuduc, R., Chandramowlishwaran, A., Choi, J. et al. 2010. On the Limits of GPU Acceleration. In Proceedings of the 2010 USENIX Workshop. Hot Topics in Parallelism (HotPar), Berkeley, California, June.
Wallis, J.R., Kendall, R.P., and Little, T.E. 1985. Constrained Residual Acceleration of Conjugate Residual Methods. Paper SPE 13563 presented at the 8th SPE Reservoir Simulation Symposium, Dallas, 10–13 February. http://dx.doi.org/10.2118/13563-MS.
Zhou, Y. and Tchelepi, H.A. 2013. Multi-GPU Parallelization of Nested Factorization for Solving Large Linear Systems. Paper SPE 163588 Presented at the 2013 SPE Reservoir Simulation Symposium, Woodlands, Texas, 18–20 February. http://dx.doi.org/10.2118/163588-MS.