
Motivation


In this work, we assess the impact of
heterogeneity for scheduling independent tasks on masterslave
platforms. We assume a realistic oneport model where the master can
communicate with a single slave at any timestep. We target online and offline
scheduling problems, and we focus on simpler instances where all tasks
have the same size.
While most offline problems can be solved in polynomial time, online
problems are much more difficult. Even if it can be solved in
polynomial time on fully homogeneous platforms, we show that there
does not exist any optimal deterministic online algorithm for heterogeneous
platforms. Whether the source of heterogeneity comes from computation
speeds, or from communication bandwidths, or from both, we look for
lower bounds on the competitive ratio of any deterministic
algorithm. We search such bounds for the most important objective
functions: the minimization of the makespan (or total execution time),
the minimization of the maximum response time (difference between
completion time and release time), and the minimization of the sum of
all response times.


Results



When the platform is fully homogeneous (all communication and all processor are identicals), we design an algorithm which is optimal for the online problem and for three different objective functions (makespan, maximum response time, total completion time).

When the communications are homogeneous ( but the processor are heterogeneous), we design an optimal makespan minimization algorithm for the offline problem with release dates. This algorithm generalizes, and provides a new proof of, a result of Simons.

When the computations are homogeneous (but all communication are identicals), we failed to derive an optimal makespan minimization algorithm for the offline problem with release dates, but we provide an efficient heuristic for this problem.

For heterogeneous platforms, we show that there does not exist any optimal online algorithm. This holds true for the three objective functions (makespan, maximum response time, total completion time). Altogether, we obtain nine lower bounds which nicely assess the impact of heterogeneity on online scheduling.






Motivation


Scheduling problems are already difficult on traditional parallel
machines. They become extremely challenging on heterogeneous
clusters, even when embarrassingly parallel applications are
considered.
In this work we deal with the problem of scheduling
multiple applications, made of collections of independent and
identical tasks, on a heterogeneous masterworker platform. The
applications are submitted online, which means that there is no a
priori (static) knowledge of the workload distribution at the
beginning of the execution. The objective is to minimize the
maximum stretch, i.e. the maximum ratio between the actual time an
application has spent in the system and the time this application
would have spent if executed alone.


Results



On the theoretical side, we design an optimal algorithm for the offline version of the problem (when all release dates and application characteristics are known beforehand). We also introduce several heuristics for the general case of online applications.

On the practical side, we have conducted extensive simulations and MPI experiments, showing that we are able to deal with very large problem instances in a few seconds. Also, the solution that we compute totally outperforms classical heuristics from the literature, thereby fully assessing the usefulness of our approach.







Motivation


Matrixproduct has been extensively studied on parallel
architectures. Cannon's algorithm, or ScaLAPACK outer product
algorithm, work well on homogeneous 2Dgrids. Extending these
algorithms to heterogeneous 2D grids has revealed to be
surprisingly difficult.
The impact of data movements has been largely neglected. The
initial distribution of the matrix to the processors, and the
collecting of the results, is usually ignored. This is because
O(n^2) coefficients need be distributed in the beginning, and
gathered in the end, as opposed to the O(n^3) computations to be
performed. However, for a given platform and problem instance
pair, a tuned and efficient algorithm may well perform better than
the asymptotically optimal generic version.

Bandwidth sharing should be taken into account. When embedding a virtual grid into a parallel cluster, several communication will share the same physical link, thereby decreasing communication speed.

In practice, most applications will not generate matrices on the fly but instead use input files from a fixed repository (disk on a data server). Computations will be delegated to available resources in the target architecture, and results will be returned to the repository. This calls for a masterworker paradigm, or more precisely for a computational scheme where the master (the processor holding the input data) assigns computations to other resources, the workers. In the simpler version, the workers only compute and return results, without distributing themselves fractions of their loads to other workers. In a more elaborate scenario, a computing tree is embedded into the platform.

Memory constraints play an important role. Fixedsize buffers will have to be assumed to provide realistic algorithms.


Results



On the theoretical side, we have derived a new, tighter, bound on the minimal volume of communications needed to multiply two matrices. From this lower bound, we have defined an efficient memory layout, i.e., an algorithm to share the memory available on the workers among the three matrices.

On the practical side, starting from our memory layout, we have designed an algorithm for homogeneous platforms whose performance is quite close to the communication volume lower bound. We have extended this algorithm to deal with heterogeneous platforms. We also adapt the approach for LU factorization.

Through MPI experiments, we have shown that our algorithm for heterogeneous platforms has far better performance than solutions using the memory layout proposed by Toledo in 1999. Furthermore, this static heterogeneous algorithm has slightly better performance than dynamic algorithms using the same memory layout, but uses fewer processors, and has a far better worst case. It is therefore a very good candidate for deploying applications on heterogeneous platforms.






Results


Work in progress...





