ABOUT
We looked at three different I/O benchmarks on two systems, a Cray XT5m and a commodity linux cluster.
GENERAL BENCHMARK OVERVIEW
We looked at benchmarking the two systems using PNetCDF, PIOVDC, and IOR. Parallel netCDF (PnetCDF) is a library providing high-performance I/O while still maintaining file-format compatibility with Unidata's NetCDF. PIOVDC is an extension to NCAR's Parallel IO software used by NCAR and various other organizations for easily writing massive data sets in an optimized, parallel manner. IOR HPC Benchmark is a package used for benchmarking parallel file systems using POSIX, MPIIO, or HDF5 interfaces.
General Procedure
In order to gather the data for the benchmarks we are using, we submit everything in a single PBS job script. All benchmarks get wrapped in a loop which allows for the IOR benchmark to execute using both MPIIO and POSIX. Within this loop IOR, the PNetCDF benchmark and the PIO benchmark are run. This results in twice as many PNetCDF and PIO runs as individual IOR runs. The reason for doing this is to try and run all benchmarks under similar system wide conditions. If the system is bogged down all runs should be similarly affected.
There are two interfaces we are using for IOR, POSIX and MPIIO. POSIX is installed with all IOR gmake options, so to get both POSIX and MPIIO you need to run with the mpiio option, 'gmake mpiio'. IOR is being run with a xfer size of 128k and a block size of 128m. You can of course change the block size and xfer size as you please. One of the reasons for choosing 128k and 128m is the file produced is relatively small (4GB) and the bandwidth was about the same as 256k and 256m (16GB size file). The general run order inside the script is listed below.
for iotype in POSIX MPIIO do mpirun ./IOR -A 001 -a ${iotype} -w -o $TESTDIR/testfile -Q 10 -e -s 1 -t 4096k -b 128m -d 0.5 > IOR-4096k-128m-${iotype}.out #This is the IOR test mpirun ./piovdc/pio/pio/TESTLIB > PIOVDC_TEST-4096k-128m-${iotype}.out #This is the PIOVDC test mpirun ./ptest > PTEST-4096k-128m-${iotype}.out #This is the PNetCDF testdone
What is Being Calculated
In IOR they calculate the performance by getting a time stamp START, then all participating tasks open a file, transfer data, close the file, and then get a STOP time. IOR does a check by calling either stat() or MPI_File_get_size() on the file and this is compared against the aggregate amount of data transferred. The time data is then used in conjunction with the aggregate amount of data transferred to calculate the MiB/sec. Something to note is in our runs of IOR we are not using the -r (Read) parameter, we are only testing the time it takes to write data to a file.
In the PNetCDF benchmark we are timing things similarly. We create a START time using MPI_Wtime() then all participating tasks open a file, call all of the needed defines for the dimensions, write out the data, close the file then they get a STOP time, again using MPI_Wtime. All processes report their own aggregate time as well as a few other times such as individual write time. In addition to reporting their own times they report their times to a single process to calculate the average run times.
In the PIOVDC benchmark we are timing three main things. We are looking at the time it takes to write out uncompressed data to the normal CDF format, this is similar to our PNetCDF run. The time it takes to write out the data in VDF format when we apply the compression to it, using libraries built using the vapor package. The final thing being the total time it takes to perform the compression. This last one is used to then calculate the equivalent I/O time of the compressed data so it can be compared with the uncompressed data. The same method as above is used to calculate the times. That is we take a start time with MPI_Wtime() perform the various calls to open files for writing, write, close the files and make a final call to MPI_Wtime(). There is also an option when building that produces fine grained data. This will be mentioned below in the individual system builds section.
SYSTEM SPECIFICS
Lynx
Lynx is a single cabinet Cray XT5m supercomputer. It is comprised of seventy-six (76) nodes with twelve (12) processors; two (2) hex-core AMD 2.2 GHz Opteron. Each node has 16GB of memory. 10 I/O nodes, each has a single dual-core AMD 2.6 GHz Opteron chip.
Two (2) login nodes (each with dual-port gigabit Ethernet adapters) for interactive user login sessions.
Four (4) nodes reserved for system functions (each equipped with a dual-port Fibre-Channel adapter and dual-port gigabit Ethernet adapter), two (2) nodes for system management and two (2) nodes for managing the Cray’s Lustre filesystems (deployed on 32 terabytes of LSI RAID disk).
Four (4) nodes (each equipped with a single-port, Myricom 10-gigabit Ethernet optical adapters) are used for external Lustre filesystem testing and external GPFS filesystem testing (ala GLADE) via Cray’s DVS.
On Lynx all code was built and run using the Intel compiler suite. By default Lynx starts with the PGI Programming Environment, to switch to the Intel compiler suite enter:
module remove PrgEnv-pgi module load PrgEnv-intel
This module is using Intel version 11.1.059.
The version of IOR being used is 2.10.2. In order to get it built on a Cray XT5 system, you need to make a modification in the Makefile.config file setting CC=cc. All of this information can be found in the USER_GUIDE in the top directory of IOR.
The version of PNetCDF is 1.2.0, this can be found under /contrib/pnetcdf/1.2.0. As everything was being built with the Intel compiler suite the intel version of PNetCDF was used, '/contrib/pnetcdf/1.2.0/intel/'.
In order to build the PNetCDF benchmark the following command was used
CC parallel_test.cpp -o ptest -I/contrib/pnetcdf/1.2.0/intel/include -L/contrib/pnetcdf/1.2.0/intel/lib -lpnetcdf
For PIOVDC it is recommended to first look at the wiki for PIOVDC as it provides useful information on where to get it and how to build with respect to the VAPOR project. The current version of VAPOR is 2.1.0 and the version of PIOVDC being used is 1.4 from the branch currently located at:
svn co https://parallelio.googlecode.com/svn/branches/pio1_4_0_vdc
Below are the settings used to build the code on Lynx and from which build my results are based.
inside the vapor/lib directory (~/piovdc/vapor/vapor/lib)
./configure --with-pnetcdf=/contrib/pnetcdf/1.2.0/intel --with-expat=/usr --enable-debug make clean && make
inside the pio sub-directory (~/piovdc/pio/pio)
make clean cc -c topology.c ./configure --enable-pnetcdf=yes PNETCDF_PATH=/contrib/pnetcdf/1.2.0/intel --enable-netcdf=no --enable-compression=yes make ftn -DDEBUG -c test_lib.F90 cp ~/piovdc/vapor/vapor/lib/common/libpiocommon.a . cp ~/piovdc/vapor/vapor/lib/vdf/libpiovdc.a . ftn test_lib.o -o TESTLIB -cxxlib -L. -L/contrib/pnetcdf/1.2.0/intel/lib -lpio -lpiovdc -lpiocommon -lpnetcdf -lexpat cd
For the PIOVDC and PNetCDF benchmarks a problem size of 1024^3 was used. As these benchmarks are looking at write time the 1024^3 grid was filled with a floating point value of 53.53 prior to any timing calculations were performed. Due to memory issues larger sizes were not tested. There were multiple memory issues running on this test system, these caused some initial issues in the run script. With the PNetCDF code two MPICH directives need to be changed, MPICH_UNEX_BUFFERSIZE=629145600 and MICH_MAX_SHORT_MSG_SIZE=12800. Without these changes you will run into OOM errors. These errors were encountered even on smaller problem sizes like 256^3. If you set these MPICH directives too soon in your script it will cause issues for the PIOVDC benchmark. This was successfully run with a mppwidth of 64 and a mppnppn of 8, that is 64 processes across 8 nodes.
It is important to note that on Lynx anytime you are making calls to MPI-IO collective write:MPI_FILE_WRITE_AT_ALL, you will need to use /ptmp (Lustre).
Janus Cluster
Janus is a Dell Linux cluster with 1,368 nodes. Each node has two (2) hex-core 2.8 GHz Intel Westmere processors. Each node has 24GB of memory. Each node is connected using fully non-blocking quad-data-rate(QDR) InfiniBand interconnect. The system uses a Lustre parallel file system.
On Janus all code was built using the Intel compiler suite version 12.0.0 along with OpenMPI version 1.4. To use these you need to load the appropriate dotkits.
use ICS use OpenMPI-1.4-ICS
Just like on Lynx all of the same versions of IOR, PNetCDF, PIOVDC and VAPOR were used on Janus.
To build the PNetCDF benchmark on Janus:
mpicc parallel_test.cpp -o ptest -I/curc/tools/free/redhat_5_x86_64/parallel-netcdf-1.2.0_ics-2011.0.013_openmpi-1.4.3_torque-2.5.8_ib/include -L/curc/tools/free/redhat_5_x86_64/parallel-netcdf-1.2.0_ics-2011.0.013_openmpi-1.4.3_torque-2.5.8_ib/lib -lpnetcdf
The paths to configure things on Janus are a bit longer than on Lynx but again here are the commands I used to build PIOVDC.inside the vapor/lib directory (~/piovdc/vapor/vapor/lib)
./configure --with-pnetcdf=/curc/tools/free/redhat_5_x86_64/parallel-netcdf-1.2.0_ics-2011.0.013_openmpi-1.4.3_torque-2.5.8_ib --with-expat=/usr --enable-debug make clean && make
inside the pio sub-directory (~/piovdc/pio/pio)
make clean cc -c topology.c ./configure --enable-pnetcdf=yes PNETCDF_PATH=/curc/tools/free/redhat_5_x86_64/parallel-netcdf-1.2.0_ics-2011.0.013_openmpi-1.4.3_torque-2.5.8_ib --enable-netcdf=no --enable-compression=yes make mpif90 -DDEBUG -c test_lib.F90 cp ~/piovdc/vapor/vapor/lib/common/libpiocommon.a . cp ~/piovdc/vapor/vapor/lib/vdf/libpiovdc.a . mpif90 test_lib.o -o TESTLIB -cxxlib -L. -L/curc/tools/free/redhat_5_x86_64/parallel-netcdf-1.2.0_ics-2011.0.013_openmpi-1.4.3_torque-2.5.8_ib/lib -lpio -lpiovdc -lpiocommon -lpnetcdf -lexpat cd
For the PIOVDC run on Janus used 256 I/O processes were used due to memory constraints when using a 2048^3 grid. For the PNetCDF benchmark a problem size of 2048^3 was similarly used but it was able to be run using far less processors. Even though it could be run without memory issues it was also run on 256 processes. As these benchmarks are looking at write time the 2048^3 grid was filled with a floating point value of 53.53 prior to any timing calculations were performed.
These runs seem to require a lot of memory. If we use the information produced from qstat -f JOBID and look at the memory_usage.vmem for a run given run (a run includes 2 of each benchmark) returns memory_usage.vmem = 395878072kb.
RESULTS
IOR-POSIX is IOR running with POSIX interface.
IOR-MPIIO is IOR running with the MPI interface.
PNetCDF is a simple benchmark that writes an array of floats out to disk using the Parallel NetCDF file format.
PNetCDF from PIOVDC is the PIOVDC benchmark writing uncompressed data out to disk using the Parallel NetCDF file format.
PIOVDC (I/O) is the PIOVDC benchmark with a wavelet transform compression applied to the data. The time it takes to perform the compression is removed from the overall time.
PIOVDC aggregate is the PIOVDC benchmark with a wavelet transform compression applied to the data. This time includes the time it took to compress the data.
Lynx
|
IOR-POSIX |
IOR-MPIIO |
PNetCDF |
PNetCDF from PIOVDC |
PIOVDC (I/O) |
PIOVDC aggregate |
---|---|---|---|---|---|---|
Maximum |
209.40 MiB/sec (20.51 sec) |
212.19 MiB/sec (20.24 sec) |
137.44 MiB/sec (31.24 sec) |
88.59 MiB/sec (48.48 sec) |
129.6 MiB/sec (33.14 sec) |
113.92 MiB/sec (37.70 sec) |
Average |
206.67 MiB/sec (20.78 sec) |
211.48 MiB/sec (20.30 sec) |
125.55 MiB/sec (34.21 sec) |
68.27 MiB/sec (62.91 sec) |
97.59 MiB/sec (44.01 sec) |
87.77 MiB/sec (48.93 sec) |
Minimum |
204.32 MiB/sec (21.02 sec) |
209.37 MiB/sec (20.51 sec) |
112.69 MiB/sec (38.11 sec) |
63.68 MiB/sec (67.44 sec) |
60.65 MiB/sec (70.81 sec) |
56.97 MiB/sec (75.39 sec) |
A second set of runs on Lynx returns results very similar to the first set of runs. The IOR numbers are incorrect in the LynxIO.pdf file. The values listed in the table above give a better idea of where the IOR rates sit.
Janus Cluster
Notice the IOR times do not match what is in the file containing the plot.
|
IOR-POSIX |
IOR-MPIIO |
PNetCDF |
PNetCDF from PIOVDC |
PIOVDC (I/O) |
PIOVDC aggregate |
---|---|---|---|---|---|---|
Maximum |
232.86 MiB/sec (147.55 sec) |
272.48 MiB/sec (126.10 sec) |
372.38 MiB/sec (92.27sec) |
312.81 MiB/sec (109.84 sec) |
254.66 MiB/sec (134.92 sec) |
247.05 MiB/sec (139.08 sec) |
Average |
218.20 MiB/sec (157.46 sec) |
250.60 MiB/sec (137.10 sec) |
298.95 MiB/sec (114.93 sec) |
250.71 MiB/sec (137.05 sec) |
200.38 MiB/sec (171.47 sec) |
195.20 MiB/sec (176.02 sec) |
Minimum |
206.48 MiB/sec (166.40 sec) |
212.33 MiB/sec (148.04 sec) |
232.09 MiB/sec (148.04 sec) |
192.27 MiB/sec (178.70 sec) |
123.44 MiB/sec (278.35 sec) |
121.51 MiB/sec (282.77 sec) |
A second set of runs is listed below. In this set the PNetCDF from the PIOVDC benchmark code performed better this time. Looking at individual runs the times tend to be similar but there are runs where one is significantly better/worse. Again the IOR data is off in the file containing the plot.
|
IOR-POSIX |
IOR-MPIIO |
PNetCDF |
PNetCDF from PIOVDC |
PIOVDC (I/O) |
PIOVDC aggregate |
---|---|---|---|---|---|---|
Maximum |
258.29 MiB/sec (133.02 sec) |
227.25 MiB/sec (151.19 sec) |
324.19 MiB/sec (105.98 sec) |
402.81 MiB/sec (85.29 sec) |
266.83 MiB/sec (128.77 sec) |
258.44 MiB/sec (132.94 sec) |
Average |
250.11 MiB/sec (137.37 sec) |
216.84 MiB/sec (158.45 sec) |
260.22 MiB/sec (132.04 sec) |
338.48 MiB/sec (101.51 sec) |
241.05 MiB/sec (142.54 sec) |
234.20 MiB/sec (146.71 sec) |
Minimum |
236.77 MiB/sec (145.11 sec) |
206.40 MiB/sec (166.47 sec) |
182.17 MiB/sec 185.55 sec) |
299.23 MiB/sec (114.82 sec) |
203.11 MiB/sec (169.16 sec) |
198.22 MiB/sec (173.34 sec) |