Sidd Gosh's notes on his attempt to run CAM on NCAR's BlueGene in Fall 2005.
Part of CCSM on BlueGene Project.
Eulerian
- The T31 and T42 runs almost without any changes, I think I had to just switch off the system call stuff which are anyway minor utilities for this model. These work just fine even in 'virtual node' mode.
- The T85 fails while trying to write the restart files at the end of integration. I had to change the restart module to use mpi-io and it worked with that even in 'vn' mode.
As you know the maximum MPI tasks that can be created in Eulerian
core is equal to the number of 'lat's so, it it 48(T31), 64(T42)
and 128(T85).FV
- Mostly it was 1x1.25 resolution. In this case we had problems with restart once again only while trying in 'vn' mode but 'co' worked just fine. With the constraint of 3 lat/vert points per domain and #-of yz domain == #-of xy we could run up to:
npr_yz = 60, 8, 8, 60 i.e. 480 MPI tasks.
- I tried pnetcdfed CAM as given by Yu-Heng and that too worked fine. In both of these cases I had to just turn of those system calls.
- Since, we can't go beyond 480 in this resolution, we tried a hack, there is a heavy OpenMP loop which we decomposed using MPI in a separate communicator and that way we could run upto 960 nodes with gain in speed w.r.t 480 of about 10%!
- pilgrim has a huge static allocation which spills the memory with 0.5x0.625 case. (since fixed - RJ)