面向国产DCU异构架构的湍流燃烧数值模拟并行加速方法

王栋志; 陈坚强; 李彬; 韩熙; 张威龙; 张红卫; 刘丽丽

doi:10.7638/kqdlxxb-2026.0026

面向国产DCU异构架构的湍流燃烧数值模拟并行加速方法

Parallel acceleration methods for turbulent combustion numerical simulation on DCU-based heterogeneous architectures

摘要

摘要: 航空发动机燃烧室内部湍流燃烧具有多尺度、强耦合和高非线性特征，其高保真数值模拟不仅依赖高精度离散模型，也对异构并行计算架构提出了更高要求。本研究设计并实现了一种面向国产异构平台的湍流燃烧高保真模拟架构，基于节点型有限体积法离散框架，对不可压SIMPLE算法、SST湍流模型与小火焰面燃烧模型进行了全耦合深度优化。针对湍流燃烧各阶段计算特点，构建了覆盖完全循环展开、分组循环展开与并行规约的自适应并行模式，实现了多物理任务在异构CPU-DCU（deep computing unit，深度计算单元）异构众核架构的高效执行。面向通信带宽瓶颈，提出以最小化数据通信为原则的优化策略，通过节点重排序、数据常驻DCU、通信-计算重叠及异步Gauss-Seidel迭代等方法，平均降低通信开销达29.5%。在性能与架构协同调优层面，明确了国产海光DCU上线程块规模对计算效率的影响规律，通过着色分组策略，相对于原子操作实现最高41倍的加速，最终实现单DCU卡相比单CPU核心16倍的峰值计算吞吐量，为开展航空发动机燃烧室数值模拟提供了强有力的计算工具。

Abstract: Turbulent combustion inside an aero-engine combustor exhibits multi-scale, strongly coupled, and highly nonlinear characteristics. Its high-fidelity numerical simulation not only relies on high-precision discretization models but also places higher demands on heterogeneous parallel computing architectures. This study designed and implemented a high-fidelity turbulent combustion simulation framework for domestic heterogeneous computing platforms. Built upon a node-centered finite volume discretization framework, the architecture thoroughly optimized the fully coupled integration of the incompressible SIMPLE algorithm, the SST turbulence model, and the steady flamelet combustion model. In view of the computational characteristics of different stages in turbulent combustion simulation, an adaptive parallelization scheme was developed, incorporating complete loop unrolling, grouped loop unrolling, and parallel reduction, thereby enabling efficient execution of multiphysics workloads on a CPU–DCU (Deep Computing Unit) many-core architecture. To address communication bandwidth bottlenecks, an optimization strategy guided by the principle of minimizing data movement was proposed. Through node reordering, persistent data residency on the DCU, communication–computation overlap, and asynchronous Gauss–Seidel iterations, the communication overhead is reduced by an average of 29.5%. At the level of performance optimization and architecture-aware co-design, the influence of thread-block size on computational efficiency on the domestic Hygon DCU platform was systematically identified. Moreover, a coloring-based grouping strategy achievs up to a 41-fold speedup relative to atomic operations. Ultimately, the proposed framework attains a peak computational throughput on a single DCU card 16 times that of a single CPU core, providing a powerful computational tool for numerical simulations of aero-engine combustors.

HTML全文

参考文献(16)

施引文献

资源附件(0)