Thesis  
My Phd. Thesis is about algorithmic aspects of heterogeneous and parallel computing, more precisely on static and dynamic approaches.
The thesis is available here in pdf.
The slides of the defense are available in pdf

H
e
t
e
r
o
g
e
n
e
i
t
y
'
s

i
m
p
a
c
t
Motivation  
In this work, we assess the impact of heterogeneity for scheduling independent tasks on master-slave platforms. We assume a realistic one-port model where the master can communicate with a single slave at any time-step. We target on-line and off-line scheduling problems, and we focus on simpler instances where all tasks have the same size.

While most off-line problems can be solved in polynomial time, on-line problems are much more difficult. Even if it can be solved in polynomial time on fully homogeneous platforms, we show that there does not exist any optimal deterministic on-line algorithm for heterogeneous platforms. Whether the source of heterogeneity comes from computation speeds, or from communication bandwidths, or from both, we look for lower bounds on the competitive ratio of any deterministic algorithm. We search such bounds for the most important objective functions: the minimization of the makespan (or total execution time), the minimization of the maximum response time (difference between completion time and release time), and the minimization of the sum of all response times.


Results  
  • When the platform is fully homogeneous (all communication and all processor are identicals), we design an algorithm which is optimal for the on-line problem and for three different objective functions (makespan, maximum response time, total completion time).

  • When the communications are homogeneous ( but the processor are heterogeneous), we design an optimal makespan minimization algorithm for the off-line problem with release dates. This algorithm generalizes, and provides a new proof of, a result of Simons.

  • When the computations are homogeneous (but all communication are identicals), we failed to derive an optimal makespan minimization algorithm for the off-line problem with release dates, but we provide an efficient heuristic for this problem.

  • For heterogeneous platforms, we show that there does not exist any optimal on-line algorithm. This holds true for the three objective functions (makespan, maximum response time, total completion time). Altogether, we obtain nine lower bounds which nicely assess the impact of heterogeneity on on-line scheduling.


B
a
g
-
o
f
-
t
a
s
k
s
Motivation  
Scheduling problems are already difficult on traditional parallel machines. They become extremely challenging on heterogeneous clusters, even when embarrassingly parallel applications are considered.

In this work we deal with the problem of scheduling multiple applications, made of collections of independent and identical tasks, on a heterogeneous master-worker platform. The applications are submitted online, which means that there is no a priori (static) knowledge of the workload distribution at the beginning of the execution. The objective is to minimize the maximum stretch, i.e. the maximum ratio between the actual time an application has spent in the system and the time this application would have spent if executed alone.


Results  
  • On the theoretical side, we design an optimal algorithm for the offline version of the problem (when all release dates and application characteristics are known beforehand). We also introduce several heuristics for the general case of online applications.

  • On the practical side, we have conducted extensive simulations and MPI experiments, showing that we are able to deal with very large problem instances in a few seconds. Also, the solution that we compute totally outperforms classical heuristics from the literature, thereby fully assessing the usefulness of our approach.


M
a
t
r
i
x

P
r
o
d
u
c
t
Motivation  
Matrix-product has been extensively studied on parallel architectures. Cannon's algorithm, or ScaLAPACK outer product algorithm, work well on homogeneous 2D-grids. Extending these algorithms to heterogeneous 2D grids has revealed to be surprisingly difficult.
The impact of data movements has been largely neglected. The initial distribution of the matrix to the processors, and the collecting of the results, is usually ignored. This is because O(n^2) coefficients need be distributed in the beginning, and gathered in the end, as opposed to the O(n^3) computations to be performed. However, for a given platform and problem instance pair, a tuned and efficient algorithm may well perform better than the asymptotically optimal generic version.
  • Bandwidth sharing should be taken into account. When embedding a virtual grid into a parallel cluster, several communication will share the same physical link, thereby decreasing communication speed.

  • In practice, most applications will not generate matrices on the fly but instead use input files from a fixed repository (disk on a data server). Computations will be delegated to available resources in the target architecture, and results will be returned to the repository. This calls for a master-worker paradigm, or more precisely for a computational scheme where the master (the processor holding the input data) assigns computations to other resources, the workers. In the simpler version, the workers only compute and return results, without distributing themselves fractions of their loads to other workers. In a more elaborate scenario, a computing tree is embedded into the platform.

  • Memory constraints play an important role. Fixed-size buffers will have to be assumed to provide realistic algorithms.


Results  
  • On the theoretical side, we have derived a new, tighter, bound on the minimal volume of communications needed to multiply two matrices. From this lower bound, we have defined an efficient memory layout, i.e., an algorithm to share the memory available on the workers among the three matrices.

  • On the practical side, starting from our memory layout, we have designed an algorithm for homogeneous platforms whose performance is quite close to the communication volume lower bound. We have extended this algorithm to deal with heterogeneous platforms. We also adapt the approach for LU factorization.

  • Through MPI experiments, we have shown that our algorithm for heterogeneous platforms has far better performance than solutions using the memory layout proposed by Toledo in 1999. Furthermore, this static heterogeneous algorithm has slightly better performance than dynamic algorithms using the same memory layout, but uses fewer processors, and has a far better worst case. It is therefore a very good candidate for deploying applications on heterogeneous platforms.


B
i
c
r
i
t
e
r
i
a
Motivation  


Results  
Work in progress...


Last modification : 2008-12-02 11:27:59