Trying to wrap things up and get something working, without much success. I think the biggest problem is that I'm not a particularly good debugger anyway and doing it on multiple processors just hurts my head (that and I'm pretty tired now). I'm posting the source I've generated now, even though I mauled it a bit trying to figure things out. I've tried pretty hard to explain the details of what I'm trying to do, even if my execution isn't working correctly. The source is source.tar.gz.
I'm doing my exit paperwork, etc. in a little bit and I was unable to figure out this problem. I'm not updating the code here because it's the cleanest iteration (I made quite a few changes, but most of them were hack-ish and didn't help the situation anyway). I've verified, as best I can, that the data in the sending "chunks" is what it should be (for each node, it is the rank--or rank+1, I don't remember). The problem is, it never seems to successfully be received at the other end. Sometimes some received entries are correct, but never the entire array. I think the receive isn't completing before I'm trying to access it, but MPI_Waitall should prevent this. I tried to do a blocking receive (MPI_Recv) as well, but that didn't help either. I even attempted to hack in the same transfer mechanism as used in mem_xfer in Toy Programs, but I wasn't able to get that working (I verified that it works fine within the mem_xfer context). I actually remember having similar issues when writing mem_xfer, but, since I'm copying the send/receive commands, I don't understand why they would be showing up again. It'd be nice if I had better notes at this point...
MPI does pass error information back, which I barely looked at. Almost all the MPI commands return int's with values != 0 on an error. Also, there is a status object that possibly can be accessed for information, but I wasn't able to look at that very closely. I'm not particularly good at debugging and MPI just complicates the problem more.
The code as written doesn't send to and receive from a node to itself. I thought that a self-send/receive might be causing the problems. In some test cases I wrote, this does not seem to be the issue, so this may be unnecessary. Whatever the case, it can be handled cleaner than how I did it.
Some information on the source file contents:
- Bounds.cpp & Bounds.h contain the classes I wrote to facility (or try to anyway...) the routines. Documentation of "what the members should do" is found in the header portion.
- HaarBlock3D.cpp & HaarBlock3D.h contain the class to do a forward Haar transform. The Transform Code page has a little more information. Basically, this is just reworked from the Vapor source.
- Makefile will build the different applications. "make all" compiles SimpleHaar and SimpleMPI. The former is a simple example of doing a transform with the HaarBlock3D class. The latter is the botched MPI application which will not send correctly. "make runSimpleHaar" or "make runSimpleMPI" will submit the jobs using the batch submit process on bluevista. The project number (-P option) may need changed on the bsub line for another user. Also, this jub submit command should work with bluefire, but should be reviewed (e.g. implement the measures discussed in the section meeting today).
- SimpleHaar.cpp is the source for SimpleHaar described above.
- SimpleMPI.cpp is the source for SimpleMPI described above.