The data sent by MPI must be a contiguous array (for any single send). The blocks we want to send will oftentimes not be contiguous. For instance, if we have a (2D for simplicity) 10 x 10 matrix and we want to send the upper left 5 x 5 component, it is not contiguous in memory. The only time our memory is contiguous is when we want to send successive rows or columns (and the data is stored row-major or column-major, respectively). In general, we'll need to move the memory into a sending buffer before MPI ships the data. It may or may not be worth testing for already contiguous data--I doubt it is, compared to the MPI send, the local memory copy should be neglibible.
For similar reasons, received data must be placed in a receiving buffer before being ordered correctly in local memory.
If we were to send all the data from the data set at once, our memory requirements is approximately triple the original dataset. For instance, if we have 2 gigs of data, we would have 2 gigs in send buffers and 2 gigs in receive buffers. We can clear the send buffers before we write to the final memory location (eliminating the need for a memory requirement of quadruple the original dataset). Similarly, once the data is in the final memory location, we clear the receive buffers. We will need another copy of (transformed) data when we perform the transform. This is another reason that sending, receiving, and transforming all the data at once will require approximately triple the original dataset in memory.
For this reason, it is probably best to send and transform one set of superblocks at a time (by set, I mean enough superblocks so that every processor has something to do). The memory requirements for this is something like Original_Data * 1.125 + Number_Processors * Size_Of_Superblock * 2. Original_Data * 1.125 comes from the original data plus the lambda blocks which we must keep in memory to do successfive transforms. The Number_Processors * Size_Of_Superblock * 2 comes from the send buffer, recieve buffer, final data location, and transformed data location (of which two, at most, are present at a given time). The second level of the transform may actually require more memory than this (the formula will be similar, with the additional requirement of storing the second set of lambda blocks). After the second level, we can discard the original lambda blocks and memory useage will decrease.
There is probably a memory space vs. runtime trade-off here, and I doubt that this is the optimum solution. However, it seems to be a straightforward implementation, so it is what I'm working on as a starting point.