Design notes for the parallel/multiprocessor algorithms and data structures =========================================================================== $Header$ The Problem ----------- Doing horizon finding on multiple processors is difficult, because when we interpolate the geometry to the trial horizon surface, we must use a global interpolator (CCTK_InterpGridArrays()), which is (necessarily) a a collective operation, i.e. the interpolator must be called synchronously on every processor, with identical arguments except for the number of interpolation points, the interpolation coordinates, and the output array pointers. It's quite tricky to arrange this synchronicity when different processors are locating different horizons. Newton's Method --------------- We find the apparent horizon by solving Theta(h) = 0 by a global Newton's-method iteration. ("global" meaning we update all the horizon surface points simultaneously at each iteration.) That is, the basic uniprocessor solution algorithm is while (true) { compute Theta(h) if (||Theta(h)|| small or too many iterations) exit compute Jacobian[Theta(h)] solve linear system J.dh = -Theta update h += dh } As mentioned above, the computation of Theta(h) must be done synchronously (across all processors). It turns out that the evaluation of the Jacobian matrix must also be done synchronously -- it too involves a geometry interpolation. High-Level Outline of the Solution ---------------------------------- We use the following design to do this: Horizons are identified by an integer "horizon number". This runs from 1 to N_horizons for the "genuine" horizons we want to find, or 0 for a "dummy" horizon used if there are more processors than horizons to find. [We actually implement the dummy horizon as passing NULL horizon and/or Jacobian pointers to expansion() and expansion_Jacobian(), but this is an implementation detail which doesn't matter here.] Each processor is assigned the dummy horizon, and zero or more genuine horizons; each genuine horizon is assigned to exactly one procesor. A processor with one or more genuine horizons is called an "active" processor; a processor with only the dummy horizon is called a "dummy" processor. The assignment of horizons to processors is static for the whole Cactus run. [A dynamic assignment might be more efficient, but then we'd have to move our state for a horizon from one processor to processor.] All the processors do their Newton iteration synchronously, working on genuine horizons if they have any, or dummy horizons if not. If any error occurs when computing Theta or the Jacobian, or in solving the linear equations, we treat this as failing to find this horizon. For example, suppose we have 3 horizons, which are found (the Newton iteration converges) after respectively 3, 5, and 4 iterations. Then with 2 processors the iterations would look like this (where h1/2/3 means horizon 1/2/3, -- means the dummy horizon, and a * after Theta means convergence): processor #0 processor #1 ------------ ------------ 1 h1 Theta h2 Theta 2 h1 Jacobian h2 Jacobian 3 h1 Theta h2 Theta 4 h1 Jacobian h2 Jacobian 5 h1 Theta* h2 Theta 6 -- Jacobian h2 Jacobian 7 h3 Theta h2 Theta 8 h3 Jacobian h2 Jacobian 9 h3 Theta h2 Theta* 10 h3 Jacobian -- Jacobian 11 h3 Theta -- Theta 12 h3 Jacobian -- Jacobian 13 h3 Theta* -- Theta (Notice that at line 6, processor #0 does a dummy-horizon Jacobian computation before starting its next genuine horizon. This is to keep the Theta and Jacobian computations synchronized across all processors. In a single-processor run we'd skip this and go directly to the next genuine horizon.) With 3 processors this same example would look like this: processor #0 processor #1 processor #2 ------------ ------------ ------------ 1 h1 Theta h2 Theta h3 Theta 2 h1 Jacobian h2 Jacobian h3 Jacobian 3 h1 Theta h2 Theta h3 Theta 4 h1 Jacobian h2 Jacobian h3 Jacobian 5 h1 Theta* h2 Theta h3 Theta 6 -- Jacobian h2 Jacobian h3 Jacobian 7 -- Theta h2 Theta h3 Theta* 8 -- Jacobian h2 Jacobian -- Jacobian 9 -- Theta h2 Theta* -- Theta With 4 processors it would look like this: processor #0 processor #1 processor #2 processor #3 ------------ ------------ ------------ ------------ 1 h1 Theta h2 Theta h3 Theta -- Theta 2 h1 Jacobian h2 Jacobian h3 Jacobian -- Jacobian 3 h1 Theta h2 Theta h3 Theta -- Theta 4 h1 Jacobian h2 Jacobian h3 Jacobian -- Jacobian 5 h1 Theta* h2 Theta h3 Theta -- Theta 6 -- Jacobian h2 Jacobian h3 Jacobian -- Jacobian 7 -- Theta h2 Theta h3 Theta* -- Theta 8 -- Jacobian h2 Jacobian -- Jacobian -- Jacobian 9 -- Theta h2 Theta* -- Theta -- Theta Any additional processors would similarly do all Theta and Jacobian computations on the dummy horizon. Interprocessor Synchronization ------------------------------ To implement this algorithm, after each Theta evaluation each active processor computes a "I need more iterations (either on this or another genuine horizon)" flag, and all the processors do an inclusive-or--reduction of these flags, with the result broadcast to all processors. Each processor then uses the inclusive-or-reduction result to determine whether or not to continue iterating. Broadcasting the Iteration Status --------------------------------- After each Theta evaluation each active processor also sends * which horizon it's working on * the iteration number * the expansion() status * its error norms to processor 0, which uses this information to print a status report of how the iterations are converging. If an (active) processor finds a horizon, it computes the BH diagnostics and broadcasts them, and optionally the horizon shape, to all processors. Processor 0 uses this to print the BH diagnostics, and (optionally) all processors use the diagnostics and the horizon shape to set the drift correction information and/or the BH excision mask. [All the processors actually need the ghosted horizon shape in order to be able to do angular interpolations to test if a given point is inside/outside the horizon. But since interprocessor communication is probably expensive, we only broadcast the horizon shape on the nominal grid, then recover the ghost-zone values on each processor.] Since all processors must participate in a broadcast, all processors must somehow know that that horizon was just found. Thus, after each Theta iteration, we broadcast a "which horizon I've just found, or none" integer from each active processor to all processors. Optimizing the Interprocessor Communication ------------------------------------------- In practice interprocessor communication is expensive. Moreover, in practice, given that we're doing synchronous operations across all processors (i.e. the inclusive-or--reduction with result broadcast to all processors), latency is probably more important than bandwidth. Thus for implementation, we group the different interprocessor communications described above, into two operations: After each Theta evaluation each active processor broadcasts * which horizon it's working on * the iteration number * the expansion() status * its error norms * a Boolean "I have just found a horizon" flag * a Boolean "I need more iterations" flag to all processors. All processors do the inclusive-or--reduce of the "I need more iterations" flags locally, on the values received from the broadcast. All processors then go through the "I have just found a horizon" flags received from the broadcast, and for each flag which is true, all processors then know to participate in the broadcast of the BH diagnostics and (optionally) the horizon shape from the just-found-it processor to all processors.