phasicFlow/thirdParty/Zoltan/doc/NEA_docs/writeup/hybrid_current.tex

\documentclass[12pt]{article}

\usepackage{amsmath}    % need for subequations
\usepackage{graphicx}   % need for figures
\usepackage{verbatim}   % useful for program listings
\usepackage{color}      % use if color is used in text
\usepackage{subfigure}  % use for side-by-side figures
\usepackage{hyperref}   % use for hypertext links, including those to external documents and URLs

\setlength{\baselineskip}{16.0pt}    % 16 pt usual spacing between lines
\setlength{\parskip}{3pt plus 2pt}
\setlength{\parindent}{20pt}
\setlength{\oddsidemargin}{0.5cm}
\setlength{\evensidemargin}{0.5cm}
\setlength{\marginparsep}{0.75cm}
\setlength{\marginparwidth}{2.5cm}
\setlength{\marginparpush}{1.0cm}
\setlength{\textwidth}{150mm}

\begin{comment}
\pagestyle{empty}
\end{comment}


\begin{document}

\begin{center}
{\large Hybrid Partitioning in Zoltan} \\
Nick Aase,  Karen Devine \\
Summer, 2011
\end{center}


\section{Introduction}
When used for partitioning, Zoltan has a wide range of algorithms
available to it. Traditionally they have fallen into two categories:
geometric-based partitioning, and topology-based partitioning. Each
method has its own strengths and weaknesses which ultimately come down
to the tradeoff between speed and quality, and the onus is placed
upon the user to determine which is more desirable for the project
at hand.

In our project we strived to develop a hybrid partitioning algorithm;
one that attempts to take advantage of the efficiency of geometric
methods, as well as the precision of topological ones. The reasoning
behind this concept is that problem sets with large amounts of data may
be more easily digestible by topological methods if they are first
reduced into managable pieces based on their geometry.

The two subjects chosen for this project were the Recursive
Coordinate Bisection (RCB) algorithm and Parallel Hypergraph
partitioning (PHG). RCB is an extremely fast method of partitioning,
but it can be clumsy at times when it ``cuts'' across a coordinate plane.
On the other hand, PHG has a good understanding of the relationships
between data, making its partitioning quite accurate, but it suffers
from having to spend a great deal of time finding those relationships.

For further information on implementing hybrid partitioning, please see
the developer's guide at
http://www.cs.sandia.gov/Zoltan/dev\_html/dev\_hybrid.html


\section{Parallel hypergraphs and geometric input}
In order for Zoltan to support hybrid partitioning, it is necessary
to properly and frequently obtain, preserve, and communicate coordinate
data. The first step that needed to be taken was to modify PHG to
support coordinate information. Hypergraph objects carry a substantial
amount of data already, but we had to add an array of floating point
values to store the coordinates. Currently, when a hypergraph is built and
geometric information is available from the input, each vertex will have
a corresponding subset within the array defining its coordinates;
that is, \forall\, $v_x$\in\, $H:$\, \exists\, $C_x =  \{c_0, c_1, ..., c_{n-1}\},$
where $v_x$ is an arbitrary vertex in the hypergraph $H$, $C_x$ is its
corresponding coordinate subset, and $n$ is the number of dimensions in
the system. In this way, Zoltan can treat each coordinate subset as an
element of that vertex


\section{PHG, MPI and 2-dimensional representation}
PHG is interesting in that multiple processors can share partial data
that describes the properties of hyperedges and vertices. This sort of
system can be represented in a 2-dimensional distribution similar to
Table 1. A populated field represents that a processor on the y-axis has
data related to the vertex on the x-axis. In this example, you can see
that processor $P_0$ and $P_2$ share data describing vertices $v_0$ and
$v_2$.

\begin{table}[h]
\begin{center}
\begin{tabular}{|r|l|l|l|}
  \hline
  Processor & $v_0$ & $v_1$ & $v_2$ \\
  \hline
  $P_0$ & x &   & x \\
  \hline
  $P_1$ &   & x &   \\
  \hline
  $P_2$ & x &   & x \\
  \hline
\end{tabular}
\caption{\label{tab:0/tc} Before communication}
\end{center}
\end{table}

Using Message Passing Interface (MPI) communicators, it is possible to
communicate with processors by column. We use an \texttt{MPI\_Allreduce}
call to collect data from each processor, which groups them into a usable
form. Consider Table 2.

\begin{table}[h]
\begin{center}
\begin{tabular}{|r|l|l|l|}
  \hline
  Processor & $v_0$ & $v_1$ & $v_2$ \\
  \hline
  $P_0$ & x &   &   \\
  \hline
  $P_1$ &   & x &   \\
  \hline
  $P_2$ &   &   & x \\
  \hline
\end{tabular}
\caption{\label{tab:1/tc} After communication}
\end{center}
\end{table}

This same sort of operation is performed with weight data, so implementing
it on coordinate data was simply another step in setting up PHG to support
coordinate information from the input. Afterwards the entirity of a vertex's
data will be unique to a single processor, with the number of global
vertices == $\sum_{i=0}^{numProc-1} ($number of local vertices_i$)$.


\section{Matching}
There are several matching methods already native to Zoltan and specific to
PHG, but we needed to create a new method in order to use RCB on the
hypergraph data. Before the actual matching occurs several specialized
callbacks and parameters are registered. Doing this is crucial if RCB and PHG
are to interface properly with each other.

The next task is to physically call RCB. It was easy enough to send PHG
data to RCB as we simply used the \texttt{Zoltan\_LB\_Partition} wrapper,
not unlike other standard load balancing partitioners. However, getting
matchings \emph{back} from RCB to PHG was another matter entirely. Thanks to
Dr. Devine's work, we were able to ostensibly comondeer one of RCB's unused
return values: since all matching algorithms conform syntactically to the
afforementioned load-balancing wrapper, there are some arguments and/or
values that are never used depending on what data that partitioner needs In
the case of RCB, the return value \texttt{*export\_global\_ids}, which is
defined in its prototype, was never actually computed. Dr. Devine was able
to rewire RCB so that, when using hybrid partitioning, it would return the
IDs of the matchings we need for each hypergraph (which are referred to in
the matching procedure as \emph{candidates}).

This new matching procedure is similar to PHG's agglomerative matching,
whereby candidate vertices are selected to represent groups of similar
vertices. These candidates then make up the standard vertices in the
resultant coarse hypergraph. The major difference is that standard
agglomerative matching determines its candidates by the connectivity of
vertices to one another; the more heavily connected a subset of vertices
is, the more likely they will share the same candidate. Using RCB means
making the assumption that related vertices will be geometrically similar:
recursive geometric cuts will be more likely to naturally bisect less
connected parts of the hypergraph, and the vertices that are members of
the resulting subdomains will share the same candidates. Given RCB's
track record, this method should be significantly faster than the
agglomerative matching.


\section{Reduction factor}
When using hybrid partitioning, the user passes a parameter in the input
file called \texttt{HYBRID\_REDUCTION\_FACTOR}, which is a number $> 0$
and $\leq 1$ that gets passed into RCB. This parameter defines the
aggressiveness of the overall procedure. This number simply determines
the amount by which the larger graph will be reduced (e.g. for the
original, fine hypergraph, $H_f$, where the number of vertices
$|V_f| == 1000$, and a reduction factor of $f == 0.1$, the coarse hypergraph,
$H_c$, will have $|V_c| == 100$ vertices).

This gives the user more control over the balance between quality
and efficiency.


\section{Results}
We ran experiments primarily with 2 and 128 processors on the Odin cluster
at Sandia National Labs, though there were brief, undocumented forees with
16 and 32 processors as well. Odin has two AMD Opteron 2.2GHz processors
and 4GB of RAM on each node, which are connected with a Myrinet network
\cite{Catalyurek}. The partitioning methods used were RCB, PHG, and hybrid
partitioning with a reduction factor of 0.01, 0.05, and 0.1. Each run went
through 10 iterations of the scenario. The runs with 128 processors were
given 5 different meshes to run on, whereas the 2 processor runs only ran
on the 4 smaller meshes, as the cluster was undergoing diagnostics at the
time of the experiements.

%NEED TIMES @ 128 PROCS
\begin{figure}[hgp]
  \centering
  \includegraphics[width=\textwidth, height=80mm]{128_time.pdf}
  \caption{Runtimes on 128 processors}\label{fig:Times_np_128}
\end{figure}


%NEED cutl @ 128 PROCS
\begin{figure}[hgp]
  \centering
  \includegraphics[width=\textwidth, height=70mm]{128_cutl.pdf}
  \caption{Cuts on 128 processors}\label{fig:Cuts_np_128}
\end{figure}

You can see from Figure 1 and 2 that at 128 processors the hybrid methods
are mainly slower than PHG and less accurate than RCB: both results are
the inverse of what we had hoped. There was better news looking at where
the processes were taking their time though:

%timer breakdowns for 128
\begin{figure}[hgp]
  \centering
  \includegraphics[width=\textwidth, height=70mm]{128_breakdown_percent.pdf}
  \caption{Timing by percentage on 128 processors (UL, Shockstem 3D; UR,
  Shockstem 3D -- 108; LL, RPI; LR, Slac1.5}\label{fig:Percent_np_128}
\end{figure}

The dramatic decrease in the matching time meant that RCB was, indeed,
helping on that front.

When we ran our simulations in serial, however, we saw some very different
results:

%times, cutl
\begin{figure}[hgp]
  \centering
  \includegraphics[width=\textwidth, height=80mm]{2_time.pdf}
  \caption{Runtimes in serial on 2 processors}\label{fig:Times_np_2}
\end{figure}


%NEED cutl @ 128 PROCS
\begin{figure}[hgp]
  \centering
  \includegraphics[width=\textwidth, height=70mm]{2_cutl.pdf}
  \caption{Cuts in serial on 2  processors}\label{fig:Cuts_np_2}
\end{figure}

In general the hybrid times beat the PHG times, and the hybrid cuts beat
the RCB cuts.

%time breakdowns for 2
\begin{figure}[hgp]
  \centering
  \includegraphics[width=\textwidth, height=70mm]{2_breakdown_percent.pdf}
  \caption{Timing by percentage on 2 processors (UL, Shockstem 3D; UR,
  Shockstem 3D -- 108; LL, RPI; LR, Slac1.5}\label{fig:Percent_np_2}
\end{figure}

Looking at individual timers in this serial run, we can see that RCB has
still drastically reduced the matching time. In addition, the slowdown in
the coarse partitioning has been greatly reduced.

\section{Conclusion and discussion}
The parallel implementation of hybrid partitioning is obviously not
functioning as desired, but we believe that there is ultimately a great
deal of promise in this method. Seeing the results from our serial runs
is encouraging, and it would be worth the effort to continue forward.

Perhaps it would be helpful to check for any communication issues arising
between processors. The whole system could potentially drag, was a
single processor waiting for a message. Additionally, Dr. Catalyurek had
suggested only using RCB-based coarsening on the largest, most complex
hypergraphs, and then revert to standard agglomerative matching for
coarser iterations.

At this moment, there could be four different ways to use Dr. Catalyurek's
method: the first, and perhaps simplest of the three, would be to hardwire
in the number of coarsening levels to give to RCB. A second way would be
to define a new parameter to allow the user to select the number of
RCB-based coarsenings. A third would be to write a short algorithm to
determine and use the optimal number of layers based off of the input.
Finally, there could be an option of user input, with a default to
be either of the other ways.

\begin{thebibliography}{5}

\bibitem{Catalyurek}U.V. Catalyurek, E.G. Boman, K.D. Devine, D. Bozdag,
  R.T. Heaphy, and L.A. Riesen. \emph{A Repartitioning Hypergraph Model
    for Dynamic Load Balancing.} Sandia National Labs, 2009.

\end{thebibliography}

{\small \noindent August 2011.}
\end{document}