% *======================================================================*
%  Cactus Thorn template for ThornGuide documentation
%  Author: Ian Kelley
%  Date: Sun Jun 02, 2002
%  $Header: /home/eschnett/C/carpet/Carpet/Carpet/doc/documentation.tex,v 1.2 2002/11/29 10:43:43 schnetter Exp $                                                             
%
%  Thorn documentation in the latex file doc/documentation.tex 
%  will be included in ThornGuides built with the Cactus make system.
%  The scripts employed by the make system automatically include 
%  pages about variables, parameters and scheduling parsed from the 
%  relevent thorn CCL files.
%  
%  This template contains guidelines which help to assure that your     
%  documentation will be correctly added to ThornGuides. More 
%  information is available in the Cactus UsersGuide.
%                                                    
%  Guidelines:
%   - Do not change anything before the line
%       % START CACTUS THORNGUIDE",
%     except for filling in the title, author, date etc. fields.
%        - Each of these fields should only be on ONE line.
%        - Author names should be sparated with a \\ or a comma
%   - You can define your own macros are OK, but they must appear after
%     the START CACTUS THORNGUIDE line, and do not redefine standard 
%     latex commands.
%   - To avoid name clashes with other thorns, 'labels', 'citations', 
%     'references', and 'image' names should conform to the following 
%     convention:          
%       ARRANGEMENT_THORN_LABEL
%     For example, an image wave.eps in the arrangement CactusWave and 
%     thorn WaveToyC should be renamed to CactusWave_WaveToyC_wave.eps
%   - Graphics should only be included using the graphix package. 
%     More specifically, with the "includegraphics" command. Do
%     not specify any graphic file extensions in your .tex file. This 
%     will allow us (later) to create a PDF version of the ThornGuide
%     via pdflatex. |
%   - References should be included with the latex "bibitem" command. 
%   - use \begin{abstract}...\end{abstract} instead of \abstract{...}
%   - For the benefit of our Perl scripts, and for future extensions, 
%     please use simple latex.     
%
% *======================================================================* 
% 
% Example of including a graphic image:
%    \begin{figure}[ht]
% 	\begin{center}
%    	   \includegraphics[width=6cm]{MyArrangement_MyThorn_MyFigure}
% 	\end{center}
% 	\caption{Illustration of this and that}
% 	\label{MyArrangement_MyThorn_MyLabel}
%    \end{figure}
%
% Example of using a label:
%   \label{MyArrangement_MyThorn_MyLabel}
%
% Example of a citation:
%    \cite{MyArrangement_MyThorn_Author99}
%
% Example of including a reference
%   \bibitem{MyArrangement_MyThorn_Author99}
%   {J. Author, {\em The Title of the Book, Journal, or periodical}, 1 (1999), 
%   1--16. {\tt http://www.nowhere.com/}}
%
% *======================================================================* 

% If you are using CVS use this line to give version information
% $Header: /home/eschnett/C/carpet/Carpet/Carpet/doc/documentation.tex,v 1.2 2002/11/29 10:43:43 schnetter Exp $

\documentclass{article}

% Use the Cactus ThornGuide style file
% (Automatically used from Cactus distribution, if you have a 
%  thorn without the Cactus Flesh download this from the Cactus
%  homepage at www.cactuscode.org)
\usepackage{../../../doc/ThornGuide/cactus}
\usepackage{hyperref}

\begin{document}

% The author of the documentation
\author{Erik Schnetter \textless schnetter@uni-tuebingen.de\textgreater}

% The title of the document (not necessarily the name of the Thorn)
\title{Carpet}

% the date your document was last changed, if your document is in CVS, 
% please use:
\date{$ $Date: 2002/11/29 10:43:43 $ $}

\maketitle

% Do not delete next line
% START CACTUS THORNGUIDE

% Add all definitions used in this documentation here 
%   \def\mydef etc

% Add an abstract for this thorn's documentation
\begin{abstract}
This text describes the Carpet arrangement.  Carpet is a mesh
refinement driver for Cactus that can replace PUGH, the standard
unigrid driver.  Carpet supports multiple refinement levels and
multiple grid patches.  Carpet can run in parallel, but not yet very
efficiently so.  Carpet does not yet support multiple grid
hierarchies, i.e.\ shadow hierarchies or automatic convergence tests.
\end{abstract}


\section{Overview}

\subsection{Fixed Mesh Refinement, aka Box-in-Box}

Fixed Mesh Refinement (FMR), also known as box-in-box, is a way to
increase the local resolution in unigrid applications, while retaining
the basic unigrid character of an application.  A small number (maybe
two or three) of grids with varying resolution overlay each other,
where the coarsest grid has the largest extent.  This allows the
application to benefit from the higher resolution of the smaller grids
while keeping the outer boundary far out at the same time.  The main
advantage of FMR are that it needs far less resources than globally
increasing the resolution.

\subsection{Carpet}

Carpet is the name of an FMR driver, i.e.\ the back end that handles
storage allocation for the grid functions, parallelism, I/O, and the
various inter-grid operations.  Carpet was developed in early summer
of 2000 by Erik Schnetter \cite{Carpet__erik-schnetter}, then a
research scholar in the Department for Astronomy and Astrophysics
\cite{Carpet__astro-psu-edu} of Penn State University
\cite{Carpet__psu-edu}.  In spring 2001, Carpet was coupled to Cactus
as a drop-in enhancement for the standard unigrid Cactus driver PUGH.

\subsection{Cactus}

From the main Cactus web pages \cite{Carpet__cactuscode-org}:
\begin{quote}
Cactus is an open source problem solving environment designed for
scientests and engineers.  Its modular structure easily enables
parallel computation across different architectures and collaborative
code development between different groups.  Cactus originated in the
academic research community, where it was developed and used over many
years by a large international collaboration of physicists and
computational scientists.
\end{quote}


\section{Introduction}

\subsection{Fixed Mesh Refinement}

A standard way of solving partial differential equations are finite
differences on a regular grid.  This is also called \emph{unigrid}.
Such an application discretises its problem space onto a single,
rectangular grid which has everywhere the same grid spacing.  This
grid might be broken up into several parts for parallelisation
purposes, but parallelisation should be transparent to the physics
part of the application.

Increasing the resolution in a unigrid application is somewhat
expensive.  For example, increasing the resolution by a factor of two
requires a factor of eight more storage in three dimensions.  Given a
constant Courant factor, the calculation time will even go up by a
factor of sixteen.  This behaviour makes it easy to find problems that
cannot be solved on contemporary supercomputers, no matter how big and
fast those computers are.

Apart from physical insight, which often has to be used to decrease
the problem size until it fits the current hardware, there are also
numerical and algorithmic methods to decrease the resource
requirements of the application.  Most applications need the high
resolution only in a part of the simulation domain.  Discretisation
methods that don't require a uniform resolution, such as finite
elements, can implement non-uniform resolutions very naturally.  One
problem with finite elements is that many physicists today are not
familiar with finite elements, or shy away from their perceived
complexity, or are not willing to adapt existing finite difference
code.

Fixed Mesh Refinement (FMR) is a poor man's way of implementing a
non-uniform resolution into a unigrid application with minimal changes
to its structure.  Instead of only one grid, there are several grids
or grid patches with different resolutions.  The coarsest grid usually
encloses the whole simulation domain.  Successively finer grids
overlay the coarse grid at those locations where a higher resolutions
is needed.  The coarser grids provide boundary conditions to the finer
grid through interpolation.

Instead of updating only one grid, the application has to update all
grids.  The usual approach is to first take a step on the coarsest
grid, and then recursively take several smaller steps on the finer
grids.  The Courant criterion requires that the step sizes on the
finer grids be smaller than on the coarse grid.  The boundary values
for the finer grids are found through interpolation in space and time
from the coarser grid.  In the end, the information on the finer grids
is injected into the coarse grids.

Strictly speaking there is no need for a coarse grid on the regions
covered by the finer grids.  But as stated above, the resources
required for treating the overlapping region on the coarse grid are
only minimal compared to treating the finer grids.  And because a
coarse grid with a hole often creates complications, this obvious
optimisation is often left out.

\subsection{Carpet}

Carpet is a C++ library that provides infrastructure to describe
regions of varying resolution in a convenient and efficient way.
Carpet contains routines to manage grid hierarchies, containing the
relationships between the components of the grid on the different
refinement and convergence levels.  Carpet has a notion of simulation
time and grid spacing, which are necessary for interpolation, and
contains efficient interpolators.

Carpet can run on several processors in parallel using MPI for
communication.  Each grid can be broken down into several components,
and every component has a home processor.  Carpet also contains
operators to move certain regions to a different processor, or to
synchronise all components of a grid.

Carpet is also an arrangement of thorns for Cactus, implementing a
driver and associated I/O routines for both ASCII and binary I/O.  It
should be possible to substitute Carpet for the standard Cactus driver
PUGH without changes to the application thorns and thus use Carpet as
a unigrid driver.  Making use of the FMR capabilities of Carpet
usually requires some rearranging of the application, comparable in
general to the changes necessary for a uniprocessor application to run
on multiple processors.

The driver section of Carpet contains the logic to manage storage for
the grid functions, to traverse the grid hierarchy for all scheduled
routines, and to automatically apply the necessary inter-grid
operators for prolongation (interpolation of the fine grid boundaries)
and restriction (injecting the fine grid information back into the
coarse grid).

The ASCII I/O routines use the quasi-standard gnuplot
\cite{Carpet__gnuplot-info} format.  The binary I/O routines use the
FlexIO library \cite{Carpet__FlexIO} written by John Shalf.  It allows
efficient and platform independent I/O.  The FlexIO format is based on
HDF \cite{Carpet__HDF} and also supported by several visualisation
packages.

Carpet is copyrighted by Erik Schnetter, and is available under the
LGPL licence from a CVS \cite{Carpet__CVS} repository.

\subsection{WaveToy}

Cactus comes with a sample application called \emph{WaveToy}, which
solves the scalar wave equation with various initial data and boundary
conditions.  An an example, I have extended WaveToy so that is uses
Carpet's FMR capabilities.  WaveToy serves both as a test case for
Carpet, and as example of how to convert an application to using FMR.

The equation solved by WaveToy is the well known scalar wave equation,
discretised using the Leapfrog method with three time levels, yielding
second order accuracy in space and time.  A typical set of initial
data are a plane wave, and a typical boundary condition is
periodicity.  Those allow long term simulations as well as easy and
meaningful comparisons to the analytic solution.


\section{Compiling Cactus With Carpet}

Carpet has been written in C++, using templates and the STL (Standard
Template Library).  Both templates and the STL make writing and
debugging code a lot easier.  Without templates, I would have had to
put much effort into making Carpet support all of Cactus' data types.
Without the STL, I would have had to spend quite some time
implementing basic containers such as lists or sets.  I still had to
implement a custom vector type, because STL's vector type is optimised
for large vectors only, and I needed threedimensional vectors of
integers.

The inner loops of Carpet are the inter-grid operators, that is the
routines that copy, restrict, and prolongate between grids.  Due to
Cactus it was rather easy to write these in \textsc{Fortran 77}, which
makes them both fast and portable.

Carpet is an arrangement in Cactus.  It can theoretically be compiled
without any other external library, if you don't include the binary
I/O support which requires FlexIO.  I do recommend using FlexIO, so
you should install the FlexIO library first.  Although FlexIO is
already part of Cactus in the thorn called CactusExternal/FlexIO, this
seems to be a version that has FMR support disabled and is hence not
usable.  You will have to install a complete copy of FlexIO by hand.

\subsection{Hurdle 1: FlexIO}

DESCRIBE INSTALLING FLEXIO HERE.  MENTION INSTALLING HDF4 FIRST, THEN
HDF5, THEN FLEXIO.

\subsection{Hurdle 2: STL}

DESCRIBE HOW TO GET/INSTALL AN STL, OR HOW TO MAKE THE SYSTEM'S C++
COMPILER WORK WITH THE STL.

\subsection{Hurdle 3: Templates}

DESCRIBE HOW TO INSTANTIATE, AND NOT-INSTANTIATE, THE TEMPLATES AS
NEEDED.

\subsection{WaveToy}

CONFIGURING, THORNS, COMPILING.

To be continued\ldots


\section{Running The Example Applications}

SAMPLE FILES

CARPET'S OPTIONS

To be continued\ldots

Second order convergence requires second order interpolation in time,
which requires that at least three time levels are present.


\section{Fold Your Own FMR Application}

There are three steps to take from a simple unigrid uniprocessor toy
application to a full-blown FMR multiprocessor production application.
Those steps are almost independent, and I would like to explain them
and their implications in some detail below.

\subsection{Multiple Processors}

The probably best known of these is the step from using one to using
several processors, also known as parallelisation.  Because many
people are already familiar with this step, I will describe it first.

In a uniprocessor application, it is possible to access every grid
point in arbitrary manners.  In order to allow multiple processors to
run efficiently in parallel, the grid is broken down into several
rectangular components, and each processor is assigned one of these
components.

The components will usually overlap by a few grid points, so as to
allow the processors to e.g.\ calculate spatial derivatives (which
require neighbouring grid points) without having to communicate for
every grid point.  From time to time it is then necessary to
synchronise the overlapping region, which is the only time at which
communication happens.  This allows the application to run almost
unchanged, i.e.\ without invoking communication itself.  The
synchronisation routine is provided by the driver and not by the
application.

Of course a serial applicate usually will have to be changed to
support multiple processors.  In order to do so, all the operations
that the application performs have to be classified into one of two
categories:

One category contains the so-called \emph{local} operations.  These
are operations that are applied to each and every grid point
individually, and that do not depend on any other grid point except
nearby neighbours.  Each local operation will thus involve a loop over
all grid points, and in order to run on multiple processors, after
each such loop the synchronisation routine has to be called.  An
example of a local operation would be calculating a spatial
derivative.

The other category contains so-called \emph{global} operations.  These
operations do not depend on individual grid points, and thus do not
involve loops over grid points.  The result of a global operation is
the same on all processors; therefore global operations don't involve
communication and don't require synchronisation.  An example of a
global operation would be to check how many time steps have been
taken, and decide whether the simulation should be terminated.

Typically most operations can be classified or rewritten to be either
local or global.  But often there are operations that fit neither
category, and these parts of an application are hardest to
parallelise.  Applying the boundary conditions, to give another
example, might seem at first to be neither local nor global.  But in a
slight (yet completely correct) stretch of the term "applied to all
grid points", boundary conditions can be classified as local; they are
a local operation that just does nothing to most grid points.

To give one more example, calculating an error norm does not fit these
categories.  It is neither local nor global.  It is not local because
the results involved all grid points (and not only nearby neighbours),
and it is not global because it does involve the grid points.  All
operations that do not fit the two category require typically special
handling, and often require hand-coded communication in the
application.  Luckily calculating various norms is such a common case
that there are special routines for that already present, called
\emph{reduction operators}.

\subsection{Multiple Resolution Levels}

There are several reasons why an application might want to incorporate
more than one grid, overlapping and each with a different resolution.

The most commonly known reason is probably a convergence test, where
the very same problem is treated in different resolutions.
Differences in the result are then likely caused by insufficient
resolution on the coarser (or on all) grids.  For a convergence test,
the grids are completely independent, and it does not matter whether
the simulation runs on all grids simultaneously or sequentially.  In
order to treat the grid sequentially, the application does not have to
be changed at all.

The reason of interest here is of course FMR.  For FMR, the order in
which the grids are treated is fixed.  As described above, there is
first a time step on the coarse grid, and then recursively several
smaller steps on the finer grids.  This order does require certain
changes in the application.  The sequence of operations that form a
single time step have to be identified and isolated.  (Which is to say
that there has to be a routine that calculates a time step, that is, a
complete time step, and nothing else.)  It is then the task of the FMR
driver to call this routine for the correct grids in the correct
order.

Other reasons for multiple resolution levels are e.g.\ multigrid
algorithms for elliptic equations, which I do not want to mention
here, or shadow hierarchies to determine truncation errors, which I
also want to skip here.  Shadow hierarchies are very similar to the
convergence levels described above.

Apart from this order in which the operations are performed on the
grids, there is one more complication for FMR.  The boundary values of
the finer grids have to be calculated from the coarser grids through
interpolation.  An because the time steps on the finer grids are
smaller, there is not always a corresponding value on the coarser
grids available.  This makes it necessary to interpolate in time
between time steps on the coarser grids.  The alternative would be to
take smaller steps on the coarser grids, and this would be very
expensive.

These interpolations in time make it necessary that the driver knows
which grid function contains values corresponding to what time.  The
usual way to achieve this is to have several time levels per grid
function; three time levels allow for a second order interpolation in
time.  Only grid functions with enough time levels can be
interpolated, i.e.\ boundary conditions can be calculated only for
those.

Fortunately time levels are rather widespread in applications, so they
are no new concept to introduce.  Unfortunately they are often abused,
so that values corresponding to the wrong time are stored in a time
level, usually with the excuse of saving storage.  This will in
general not work with FMR, because the driver then cannot interpolate
in time, leading to incorrect values on the boundaries of the finer
grids.

\subsection{Multiple Grid Components}

Sometimes it is convenient to have a simulation domain that is not a
rectangle.  It might instead be an L-shaped simulation domain, or a
domain that consists of two disconnected rectangular regions.  This
issue becomes more important with FMR, because there it is often
convenient to have several disconnected refined regions.  As long as
there are enough processors available, each processor can be assigned
a region or a part thereof, and no new concept need be introduced.
If, however, there are fewer processors than regions, then a new
problem arises.

A common case for that problem might be a simulation containing just
two refined regions, and running on a single processor.  The refined
grid the consists of two component.  The problem then is that the two
components cannot be treated sequentially: Imagine the time evolution
routine working on (say) the first component.  It will at some time
call the synchronisation routine.  At that time there are no values
from the second component available, because the second component has
not been treated yet.  Therefore the synchronisation routine cannot
complete.  That means in turn that the time evolution routine cannot
complete working on the first component, leading to a deadlock.  Work
on neither component can be completed before work on the other
component.

The solution is to break up the time evolution routine into several
smaller routines, each consisting of a single either local or global
operation.  (``Local'' and ``global'' have here the exact same
meanings that were defined above for parallelisation.)  A local
operation works, by definition, on individual grid points.  Hence the
local routines have to be called once for every grid component.  A
global operation, by definition, does not depend on individual grid
points.  Hence it has to be called only once per processor, and not
once per component.  That means that the driver has to be told the
category individual routine is in.

\subsection{Example}

Let me finish this section with an detailed example.  Suppose you want
to solve the equation
\begin{eqnarray}
   \frac{d}{dt} u & = & f(u) \quad ,
\end{eqnarray}
integrating using the midpoint rule, i.e.\ the simplemost second-order
time integration scheme.  Given values at the previous time $u^{n-1}$,
one first calculates a first order solution using an Euler step,
leading to the intermediate result
\begin{eqnarray}
   v^n & = & u^{n-1} + dt\; f(u^{n-1}) \quad .
\end{eqnarray}
The second and final step is then calculated via
\begin{eqnarray}  
   u^n & = & u^{n-1} + dt\; f(\frac{1}{2} [u^{n-1} + v^n]) \quad .
\end{eqnarray}

The corresponding pseudo code would look like
\begin{enumerate}
\item
Calculate Euler step, storing the result into $u^n$
\item  
Apply boundary conditions to $u^n$
\item  
Synchronise $u^n$
\item  
Calculate average of $u^{n-1}$ and $u^n$, storing the result into
$v^n$
\item  
Calculate second step, storing the result again into $u^n$
\item
Apply boundary conditions again to $u^n$
\item  
Synchronise again $u^n$
\end{enumerate}

The above algorithm looks a bit different from a naive implementation
of the midpoint rule.  One difference is that both the first and the
second step store their result into $u^n$.  This is necessary because
it would be inconvenient to apply boundary conditions to the
intermediate value $v^n$.  Remember, in order to apply boundary
conditions on the finer grids, there have to be several time levels
present.  With the above scheme, only $u$ needs several time levels.
$v$ is used only as a temporary (and could conceivably be completely
eliminated).

Note also that the first step goes all the way from time level $n-1$
to time level $n$.  The midpoint rule can be rewritten (in fact, is
usually written) so that the first step is only a half step, leading
to the time level $n - \frac{1}{2}$.  This is not possible for FMR,
because interpolating to the time $n - \frac{1}{2}$ is not possible,
and thus there could be no boundary conditions applied after the first
step.

The second thing to note is that the application of the boundary
condition and the synchronisation have been separated rather
artificially.  Normally synchronisation would be considered part of
the boundary condition.  In this case, however, the applying the
boundary condition is a local operation, whereas synchronisation
counts as global operation.  (It is not obvious that synchronisation
should be global, but as the synchronisation routine is a part of
Carpet, it was up to me to decide this.)  As explained above, local
and global operations have to be separated.

Separating the evolution steps and the boundary condition routines is,
on the other hand, just a notational convenience.  There could well be
a single routine implementing both.

For Cactus, the order in which to call the individual parts of the
time evolution routines is described in the schedule routines, i.e.\
in the files called \texttt{schedule.ccl}.  By default a routine is
assumed to be local; global routines have to be tagged with
\texttt{OPTIONS: GLOBAL}.

The tag \texttt{SYNC: groupname} indicates that the group
\texttt{groupname} should be synchronised after the scheduled routine
has been called for all grid components.  This obviously makes sense
only for local routines.  Using the \texttt{SYNC:} tag is preferred
over calling the synchronisation routine \texttt{CCTK\_SyncGroup}
directly.

The example thorn WaveToy in Carpet's arrangement is a bit simpler
than what is described here, because it uses the Leapfrog scheme which
consists of only a single step.  I would suggest looking at WaveToy as
an initial FMR example.

The thorn SpaceToy is implemented very close to the way described
here.  It evolves two variables phi and psi, but it is also coupled to
the thorn HydroToy.  This coupling introduces some additional
complications.  The thorn HydroToy, on the other hand uses a
predictor-corrector scheme, which is also a two step scheme and thus
more complex that WaveToy.  All the coupling between SpaceToy and
HydroToy is contained in SpaceToy.  I would thus suggest looking at
HydroToy first.

I assume that converting an application to FMR is straightforward
after handling the time levels has been straightened out.


\section{Carpet Under The Hood}

To be continued\ldots


\section{Moving Boxes, Adaptive Mesh Refinement}

To be continued\ldots


%% \begin{thebibliography}{9}
%% \end{thebibliography}

\bibliographystyle{amsalpha}      % initials + year
\bibliography{carpet}

% Do not delete next line
% END CACTUS THORNGUIDE

\end{document}