We have seen how we can complete the task of adding corresponding elements by sequentially performing the additions, one at a time, on each element of arrays A and B, placing the result in array C. Now suppose we want to put multiple processing units to work on this problem in parallel, concurrently. There are a few different ways to accomplish this. Using the example from the previous section, let’s look at one possible way that we might map the computation to processing units:
Here we see our original problem with eight elements in each array. Processing unit 0 would work on the first two elements, processing unit 1 would work on the next two, and so on. Theoretically, using four processing units executing concurrently, the work should get done in one-fourth the time it takes to run it sequentially. In practice, this never quite happens, because of the overhead to start processing units and possible contention for resources, such as the arrays themselves.
For many larger problems with more sophisticated computations, however, this type of decomposition of computational tasks on data elements is very common and can speed up your program. There is no standard terminology for this version of data decomposition; we like to refer to it as decomposing into equal-sized chunks. There are other possible types of decomposition and mapping of processing units to elements (perhaps others have been running through your mind). We limit ourselves to this particular way of decomposing, because:
Next we will examine example code for three combinations of parallel software and hardware: