间断Galerkin有限元隐式算法GPU并行化研究

GPU-parallelized implicit discontinuous Galerkin finite element algorithm

  • 摘要: 为了提高间断伽辽金(discontinuous Galerkin, DG)有限元方法的计算效率,围绕求解Euler方程,构建了基于图形处理器(graphics processing unit, GPU)并行加速的隐式DG算法。算法结合Roe格式进行空间离散,采用人工黏性法处理激波等间断问题,时间推进选用下上对称高斯-赛德尔(lower-upper symmetric Gauss-Seidel, LU-SGS)隐式格式。为了克服传统隐式格式固有的数据关联依赖问题,借助于本文提出的面向任意网格的单元着色分组技术,先给出了LU-SGS隐式格式的并行化改造,使得隐式时间推进能按颜色组别依次并行,由于同一颜色组内算法已不存在数据关联,可以据此实现并行化。在此基础上,再结合DG算法局部紧致等特点,基于统一计算设备架构(compute unified device architecture, CUDA)编程模型,设计了依据单元的核函数,并构建了对应的线程与数据结构,给出了DG有限元隐式GPU并行算法。最后,发展的算法通过了多个二维和三维典型流动算例考核与性能测试,展示出隐式算法GPU加速的效果,且获得的计算结果能与现有的文献或实验数据接近。

     

    Abstract: In order to improve the parallel efficiency of the discontinuous Galerkin (DG) finite element method, a graphics processing unit (GPU) parallelized implicit DG algorithm is developed for solving Euler equations with additional artificial viscosity terms. The classic Roe scheme is adopted to treat the numerical flux involved in the spatial discretization, and the implicit lower-upper symmetric Gauss-Seidel (LU-SGS) scheme is selected for time marching. In order to resolve the inherent data dependency of the traditional LU-SGS algorithm, which causes thread-racing conditions destabilizing numerical computation, a coloring method is presented for arbitrary meshes and applied to organize the computational elements into different color groups by painting neighboring elements with different colors. Algebraic operations of the elements in the same color group are independent in the algorithm and thus can be easily parallelized. Based on the presented coloring technique, the traditional LU-SGS algorithm is modified to be parallelized accordingly by performing calculations in a color-by-color manner. By taking advantage of the local compactness of the DG finite element method, a GPU-parallelized implicit DG algorithm based on the modified LU-SGS algorithm is then implemented under the compute unified device architecture (CUDA) programming model. The time marching procedure, which is the most time-consuming part of the algorithm, is assigned to be computed on GPU. The computational task is split into a set of small tasks, and element-based kernels are designed for these tasks with corresponding thread hierarchies and data structures. The resultant algorithm is verified by a set of typical two- and three-dimensional flow test cases and performance analysis, which shows that implicit GPU speedups can be achieved, and the obtained solutions agree well with experimental data or other computed results reported in the literature.

     

/

返回文章
返回