Introduction
Algorithm development is a critical component of problem solving using computers. A sequentialalgorithm is essentially a recipe or a sequence of basic steps for solving a given problem using a
serial computer. Similarly, a parallel algorithm is a recipe that tells us how to solve a given
problem using multiple processors. However, specifying a parallel algorithm involves more than
just specifying the steps. At the very least, a parallel algorithm has the added dimension of
concurrency and the algorithm designer must specify sets of steps that can be executed
simultaneously. This is essential for obtaining any performance benefit from the use of a parallel
computer. In practice, specifying a nontrivial parallel algorithm may include some or all of the
following:
Identifying portions of the work that can be performed concurrently.
Mapping the concurrent pieces of work onto multiple processes running in parallel.
Distributing the input, output, and intermediate data associated with the program.
Managing accesses to data shared by multiple processors.
Synchronizing the processors at various stages of the parallel program execution.
Typically, there are several choices for each of the above steps, but usually, relatively few
combinations of choices lead to a parallel algorithm that yields performance commensurate with
the computational and storage resources employed to solve the problem. Often, different
choices yield the best performance on different parallel architectures or under different parallel
programming paradigms.
In this chapter, we methodically discuss the process of designing and implementing parallel
algorithms. We shall assume that the onus of providing a complete description of a parallel
algorithm or program lies on the programmer or the algorithm designer. Tools and compilers
for automatic parallelization at the current state of the art seem to work well only for highly
structured programs or portions of programs. Therefore, we do not consider these in this
chapter or elsewhere in this book.
3.1 Preliminaries
Dividing a computation into smaller computations and assigning them to different processors for parallel execution are the two key steps in the design of parallel algorithms. In this section, we present some basic terminology and introduce these two key steps in parallel algorithm design using matrix-vector multiplication and database query processing as examples.
3.1.1 Decomposition, Tasks, and Dependency Graphs
The process of dividing a computation into smaller parts, some or all of which may potentially
be executed in parallel, is called decomposition. Tasks are programmer-defined units of computation into which the main computation is subdivided by means of decomposition. Simultaneous execution of multiple tasks is the key to reducing the time required to solve the entire problem. Tasks can be of arbitrary size, but once defined, they are regarded as indivisible units of computation. The tasks into which a problem is decomposed may not all be of the same size.
Note that all tasks in Figure 3.1 are independent and can be performed all together or in any
sequence. However, in general, some tasks may use data produced by other tasks and thus
may need to wait for these tasks to finish execution. An abstraction used to express such
dependencies among tasks and their relative order of execution is known as a taskdependency
graph. A task-dependency graph is a directed acyclic graph in which the nodes
represent tasks and the directed edges indicate the dependencies amongst them. The task
corresponding to a node can be executed when all tasks connected to this node by incoming
edges have completed. Note that task-dependency graphs can be disconnected and the edgeset
of a task-dependency graph can be empty. This is the case for matrix-vector multiplication,
where each task computes a subset of the entries of the product vector. To see a more
interesting task-dependency graph, consider the following database query processing example.
3.1.2 Granularity, Concurrency, and Task-Interaction
The number and size of tasks into which a problem is decomposed determines the granularity of the decomposition. A decomposition into a large number of small tasks is called fine-grained and a decomposition into a small number of large tasks is called coarse-grained. For example, the decomposition for matrix-vector multiplication shown in Figure 3.1 would usually be considered fine-grained because each of a large number of tasks performs a single dot-product. Figure 3.4 shows a coarse-grained decomposition of the same problem into four tasks, where each tasks computes n/4 of the entries of the output vector of length n.
A concept related to granularity is that of degree of concurrency. The maximum number of tasks that can be executed simultaneously in a parallel program at any given time is known as its maximum degree of concurrency. In most cases, the maximum degree of concurrency is less than the total number of tasks due to dependencies among the tasks. For example, the maximum degree of concurrency in the task-graphs of Figures 3.2 and 3.3 is four. In these task-graphs, maximum concurrency is available right at the beginning when tables for Model, Year, Color Green, and Color White can be computed simultaneously. In general, for task dependency graphs that are trees, the maximum degree of concurrency is always equal to the number of leaves in the tree. A more useful indicator of a parallel program's performance is the average degree of concurrency, which is the average number of tasks that can run concurrently over the entire duration of execution of the program. Both the maximum and the average degrees of concurrency usually increase as the granularity of tasks becomes smaller (finer). For example, the decomposition of matrix-vector multiplication shown in Figure 3.1 has a fairly small granularity and a large degree of concurrency. The decomposition for the same problem shown in Figure 3.4 has a larger granularity and a smaller degree of concurrency. The degree of concurrency also depends on the shape of the task-dependency graph and the same granularity, in general, does not guarantee the same degree of concurrency. For example, consider the two task graphs in Figure 3.5, which are abstractions of the task graphs of Figures 3.2 and 3.3, respectively (Problem 3.1). The number inside each node represents the amount of work required to complete the task corresponding to that node. The average degree of concurrency of the task graph in Figure 3.5(a) is 2.33 and that of the task graph in Figure 3.5(b) is 1.88 (Problem 3.1), although both task-dependency graphs are based on the same decomposition.
A feature of a task-dependency graph that determines the average degree of concurrency for a
given granularity is its critical path. In a task-dependency graph, let us refer to the nodes with
no incoming edges by start nodes and the nodes with no outgoing edges by finish nodes. The
longest directed path between any pair of start and finish nodes is known as the critical path.
The sum of the weights of nodes along this path is known as the critical path length, where
the weight of a node is the size or the amount of work associated with the corresponding task.
The ratio of the total amount of work to the critical-path length is the average degree of
concurrency. Therefore, a shorter critical path favors a higher degree of concurrency. For
example, the critical path length is 27 in the task-dependency graph shown in Figure 3.5(a) and
is 34 in the task-dependency graph shown in Figure 3.5(b). Since the total amount of work
required to solve the problems using the two decompositions is 63 and 64, respectively, the
average degree of concurrency of the two task-dependency graphs is 2.33 and 1.88,
respectively.
Although it may appear that the time required to solve a problem can be reduced simply by
increasing the granularity of decomposition and utilizing the resulting concurrency to perform
more and more tasks in parallel, this is not the case in most practical scenarios. Usually, there
is an inherent bound on how fine-grained a decomposition a problem permits. For instance,
there are n2 multiplications and additions in matrix-vector multiplication considered in Example
3.1 and the problem cannot be decomposed into more than O(n2) tasks even by using the most
fine-grained decomposition. Other than limited granularity and degree of concurrency, there is another important practical factor that limits our ability to obtain unbounded speedup (ratio of serial to parallel execution time) from parallelization. This factor is the interaction among tasks running on different physical processors. The tasks that a problem is decomposed into often share input, output, or intermediate data. The dependencies in a task-dependency graph usually result from the fact that the output of one task is the input for another. For example, in the database query example, tasks share intermediate data; the table generated by one task is often used by another task as input. Depending on the definition of the tasks and the parallel programming paradigm, there may be interactions among tasks that appear to be independent in a task dependency graph. For example, in the decomposition for matrix-vector multiplication, although all tasks are independent, they all need access to the entire input vector b. Since originally there is only one copy of the vector b, tasks may have to send and receive messages for all of them to access the entire vector in the distributed-memory paradigm. The pattern of interaction among tasks is captured by what is known as a task-interaction graph. The nodes in a task-interaction graph represent tasks and the edges connect tasks that interact with each other. The nodes and edges of a task-interaction graph can be assigned weights proportional to the amount of computation a task performs and the amount of interaction that occurs along an edge, if this information is known. The edges in a task interaction graph are usually undirected, but directed edges can be used to indicate the direction of flow of data, if it is unidirectional. The edge-set of a task-interaction graph is usually a superset of the edge-set of the task-dependency graph. In the database query example discussed earlier, the task-interaction graph is the same as the task-dependency graph. We now give an example of a more interesting task-interaction graph that results from the problem of sparse matrix-vector multiplication.
No comments:
Post a Comment