src/driver/README.parallel


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145

Design notes for the parallel/multiprocessor algorithms and data structures
===========================================================================

$Header$

The Problem
-----------
Doing horizon finding on multiple processors is difficult, because when
we interpolate the geometry to the trial horizon surface, we must use a
global interpolator (CCTK_InterpGridArrays()), which is (necessarily) a
a collective operation, i.e. the interpolator must be called synchronously
on every processor, with identical arguments except for the number of
interpolation points, the interpolation coordinates, and the output
array pointers.  It's quite tricky to arrange this synchronicity when
different processors are locating different horizons.


Newton's Method
---------------
We find the apparent horizon by solving   Theta(h) = 0   by a global
Newton's-method iteration.  ("global" meaning we update all the horizon
surface points simultaneously at each iteration.)  That is, the basic
uniprocessor solution algorithm is

	while (true)
	{
	compute Theta(h)
	if (||Theta(h)|| small or too many iterations)
		exit
	compute Jacobian[Theta(h)]
	solve linear system J.dh = -Theta
	update h += dh
	}

As mentioned above, the computation of Theta(h) must be done synchronously
(across all processors).  It turns out that the evaluation of the Jacobian
matrix must also be done synchronously -- it too involves a geometry
interpolation.


High-Level Outline of the Solution
----------------------------------
We use the following design to do this:

Horizons are identified by an integer "horizon number".  This runs from
1 to N_horizons for the "genuine" horizons we want to find, or 0 for a
"dummy" horizon used if there are more processors than horizons to find.
	[We actually implement the dummy horizon as
	passing NULL horizon and/or Jacobian pointers
	to expansion() and expansion_Jacobian(), but
	this is an implementation detail which doesn't
	matter here.]

Each processor is assigned the dummy horizon, and zero or more genuine
horizons; each genuine horizon is assigned to exactly one procesor.
A processor with one or more genuine horizons is called an "active"
processor; a processor with only the dummy horizon is called a "dummy"
processor.  The assignment of horizons to processors is static for the
whole Cactus run.
	[A dynamic assignment might be more efficient,
	but then we'd have to move our state for a horizon
	from one processor to processor.]

All the processors do their Newton iteration synchronously, working on
genuine horizons if they have any, or dummy horizons if not.

For example, suppose we have 3 horizons, which are found (the Newton
iteration converges) after respectively 3, 5, and 4 iterations.  Then
with 2 processors the iterations would look like this (where h1/2/3
means horizon 1/2/3, h- means the dummy horizon, and a * after Theta
means convergence):

	processor #0	processor #1
	------------	------------
1	h1 Theta	h2 Theta
2	h1 Jacobian	h2 Jacobian
3	h1 Theta	h2 Theta
4	h1 Jacobian	h2 Jacobian
5	h1 Theta*	h2 Theta
6	h- Jacobian	h2 Jacobian
7	h3 Theta	h2 Theta
8	h3 Jacobian	h2 Jacobian
9	h3 Theta	h2 Theta*
10	h3 Jacobian	h- Jacobian
11	h3 Theta	h- Theta
12	h3 Jacobian	h- Jacobian
13	h3 Theta*	h- Theta

(Notice that at line 6, processor #0 does a dummy-horizon Jacobian
computation before starting its next genuine horizon.  This is to keep
the Theta and Jacobian computations synchronized across all processors.
In a single-processor run we'd skip this and go directly to the next
genuine horizon.)

With 3 processors this same example would look like this:

	processor #0	processor #1	processor #2
	------------	------------	------------
1	h1 Theta	h2 Theta	h3 Theta
2	h1 Jacobian	h2 Jacobian	h3 Jacobian
3	h1 Theta	h2 Theta	h3 Theta
4	h1 Jacobian	h2 Jacobian	h3 Jacobian
5	h1 Theta*	h2 Theta	h3 Theta
6	h- Jacobian	h2 Jacobian	h3 Jacobian
7	h- Theta	h2 Theta	h3 Theta*
8	h- Jacobian	h2 Jacobian	h- Jacobian
9	h- Theta	h2 Theta*	h- Theta

With 4 processors it would look like this:

	processor #0	processor #1	processor #2	processor #3
	------------	------------	------------	------------
1	h1 Theta	h2 Theta	h3 Theta	h- Theta
2	h1 Jacobian	h2 Jacobian	h3 Jacobian	h- Jacobian
3	h1 Theta	h2 Theta	h3 Theta	h- Theta
4	h1 Jacobian	h2 Jacobian	h3 Jacobian	h- Jacobian
5	h1 Theta*	h2 Theta	h3 Theta	h- Theta
6	h- Jacobian	h2 Jacobian	h3 Jacobian	h- Jacobian
7	h- Theta	h2 Theta	h3 Theta*	h- Theta
8	h- Jacobian	h2 Jacobian	h- Jacobian	h- Jacobian
9	h- Theta	h2 Theta*	h- Theta	h- Theta

Any additional processors would similarly do all Theta and Jacobian
computations on the dummy horizon.


How and What to Synchronize
---------------------------
To implement this algorithm, after each Theta evaluation each active
processor computes a "we need more iterations (either on this or another
genuine horizon)" flag, and all the processors do an inclusive-or-reduction
of these flags, with the results broadcast to all processors.

(If any error occurs when computing Theta or the Jacobian, or in solving
the linear equations, we treat this as failing to find this horizon.)

Each processor then uses the inclusive-or-reduction result to determine
whether or not to continue iterating.

After each Theta evaluation each active processor also broadcasts
* which horizon it's working on
* the iteration number
* its error norms
to processor 0, which uses this information to print a status report
of how the iterations are converging.