1、Port AMSS-NCKU code to GPU,Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang 2013-8-7,2013 International School on Numerical Relativity and Gravitational Waves, Pohang Korea,Outline,Motivations from gravitational wave detec
2、tionNew parallel mesh refinement numerical schemeGPU acceleration for NRSummary,The most stringent test of GR,the anomalous precession of theperihelion of Mercury (1915, v )Deflection of Starlight (1919, v )gravitational redshift (1965, v )gravitational time delayeffect (1968, v )EvidenceofGravitati
3、onal Waves (1978, v )frame-draggingeffect (2010, v )Direct gravitational wave detection (?, v1),GR = Newtonian Gravity + PN(v) + PN(v2) + ,Gravitational wave astronomy,Search back to extremely early universe,Hear the dark universe,Gravitational wave and its detection,Category of Black Holes,Super ma
4、ssive black hole: M: 105109 MsunStellar massive black hole: M: 1-10s MsunIntermediate massive black hole: M: 10s105 Msun (mainly in globular cluster)Farrell, et al, Nature 460 (2009) 73; Feng, et al, New Astronomy Reviews 55 (2011) 166,Category of Black Holes Binary,IMBH,ALIA,Xuefei Gong, et al, CQG
5、 28, 094012 (2011),1:1000,1:1,Advanced LIGO,Abadie, et al, PRD 85, 102004 (2012),IMBH and GW detection,Data analysis and template,Ref to Sang Hoon Ohs lecture,Template model for BBH,?,Yi Pans talk, 2013,Template model for BBH,PN templates: for early stage of inspirallingEOBNR (effective one body mod
6、el together with numerical relativity): for full inspiral + merger + ring down stage; works well for mass ratio less than 1:8 and extreme mass ratio BBH, high spinning, precession!But no reliable template for mass ratio 1:10 to 1:100,From a given separation of the two BHs, when mass ratio increases
7、the number of orbit increases quickly. This requires that the numerical simulation with full GR increases much consequently. In contrast to 1:1, 1:100 needs 10 times more computation cost.,PN estimation,Computational cost,1:1, 9 days,1:100, 20 days,LSSC cluster II, 128 CPUs, for last 2 orbits comput
8、ational cost 1 to 20!,Challenge of large mass BBH to NR,Compared to 1:1, the computational cost of 1:100 BBH increase roughly 200 times!For typical simulation of 1:1 BBH, 14 days are needed. So by straight forward method to 1:100, roughly 1year is needed!,Possible ways out,1. Physical level: approxi
9、mation method, such as self force frame work (but still first order yet), 2. Numerical Algorithm level: implicit scheme R. Lau et al, PRD 84, 084023 (2011), combine Cauchy evolution to null evolution, 3. Computer level: improve scalability to use more CPUs, use GPU, ,Possible ways out,1. Physical le
10、vel: approximation method, such as self force frame work (but still first order yet), 2. Numerical Algorithm level: implicit scheme R. Lau et al, PRD 84, 084023 (2011), combine Cauchy evolution to null evolution, 3. Computer level: improve scalability to use more CPUs, use GPU, ,Mesh refinement sche
11、me,High resolution mesh grids for region near BH, while low resolution mesh grids for far region,Mesh refinement in CFD,Result based on PARAMESH,PARAMESHGrACEJASMIN,Comparison of NR and CFD,NR (only for BH): computational expensive on single grid point, but functions quite smooth few grid points (ha
12、ndrads), high order finite differenceCFD: computation on single point is cheap, but fluid dynamics quite complex (compare the lectures on HD) grid number is quite large (millions),Mesh refinement scheme,Scheme adopted by PARAMESH,Level 0,Level 1,Mesh refinement scheme,Scheme adopted by PARAMESH,Leve
13、l 0,Level 1,t,x,Mesh refinement scheme,Scheme for NR,Level 0,Level 1,Distribute data along one level to available processes,Mesh refinement scheme,Scheme for NR,F. Loeffler et al, CQG 29, 115001 (2012),Level 0,Level 1,LS scheme,Mesh refinement scheme,Parallelization limit:200x200x2006th order finite
14、 difference (8 ghost points for two sides) processesHow about distribute data on all levels and calculate them parallely?,Parallel mesh level algorithm,PX scheme: distribute data on all levels to all processes; calculate parallely,Mesh refinement scheme,Procs for lev0 procs for lev1 procs for lev2 r
15、un run run wait wait run wait run run wait wait run run run run Strong scalling property due to more data to distribute;Resource wasting (Lx procs of LS) due to waiting!Calculation speed: 2 times faster!,time,Parallel mesh level algorithm,P2 scheme: distribute data on finest level to half processes
16、and distribute data on other levels along the same level to another half processes; calculate parallely for finest level and other levels, while sequentially for other levels,lev0,lev2,lev1,Mesh refinement scheme,Procs for lower levels procs for lev2 lev1 run lev0 run lev1 run wait run lev1 run Scal
17、ling property is weaker than PX;Less waiting (2x procs LS)!Calculation speed: 2 times faster!,time,Comparison to LS scheme,more complicate case,t,x,lev0,lev1,lev2,Now, procs for finest level have to wait!,more complicate case,t,x,lev0,lev1,lev2,GPU acceleration,For system biology, Yamazaki, Igarashi
18、, Neural Networks, 2013,For GW data analysis, Zhihui Du, et al, CQG 29, 235018 (2012),Put RHS calculation to GPU,For AMSS-NCKU code, time for RHS calculation 80%RHS function involves too many variables, even only transform their addresses are time consumingSo pack these addresses and store it in con
19、stant memory (do not transform any more during evolution), save shared memory at the same time,Put RHS calculation to GPU,Keep the data on GPU till MPI data transfer between different processesUsing buffer point method to reduce MPI transfer for RK4 from 4 times to only 1 time; also reduce data tran
20、sfer times between GPU and CPU,Put RHS calculation to GPU,Arrange shared memoryDivide RHS calculation into 8 parts, let the memory requirement for each part can be satisfied with shared memoryFor one RHS calculation, copy data from global memory to shared memory once and use shared memory in most ti
21、me,Put restrict-prolong to GPU,After put RHS to GPU, the most time consuming part is Restrict-Prolong interpolationHow to treat this part? The work is going on,Test of GPU acceleration on desktop,OpenMP implementation,AMSS-NCKU = Fortran90 + C+C+ used for program flow control and memory administrati
22、onFortran90 used for main numerical calculationAdd OpenMP command in Fortran90 segments,Structure of AMSS-NCKU GPU code,Two groups MPI processes, one for cpu and one for gpu,MPI + OpenMP + CUDA,Test of AMSS-NCKU GPU code,Titan: top 1 super computer around the world (now Tianhe 2) 1024x16 cores + 1024 GPUs,Summary,Challenge from GW detection: AdvLIGO1:150 ALIA -1:1000Parallel mesh level calculation method2x speed upGPU implementation to NR-have got roughly 5x speed up; 30x speed up? in progress10x in all is ready for science simulation,