Guest post by Dan Garber and Elad Hazan: The Conditional Gradient Method, A Linear Convergent Algorithm – Part II/II

The goal of this post is to develop a Conditional Gradient method that converges exponentially fast while basically solving only a linear minimization problem over the domain $\mathcal{X}$ on each iteration.

To this end we consider the following relaxation of the problem $\min_{y\in\mathcal{X}\cap\mathbb{B}_{r_t}(x_t)}\nabla{}f(x_t)^{\top}y$ which we term a Local Linear Oracle (LLO).

Definition 1: A LLO with parameter $\rho$ for the convex domain $\mathcal{X}$ is a procedure that given a point $x\in\mathcal{X}$ , a scalar $r>0$ and a linear objective $c\in\mathbb{R}^n$ returns a point $y\in\mathcal{X}$ that satisfies:

1. $\forall{z\in\mathcal{X}\cap\mathbb{B}_{r}(x)}:\, c^{\top}y \leq c^{\top}z$

2. $\Vert{x - y}\Vert_2 \leq \rho\cdot{}r$

Clearly by taking $y_t$ to be the output of such a an LLO and choosing $\gamma_t$ appropriately we would get exponentially fast convergence of the form $\exp\left({-O\left({\frac{t}{Q\rho^2}}\right)}\right)$ (see Part I for more details).

In the following we show that a procedure for a LLO could be constructed for any polytope such that each call to the LLO amounts to a single linear minimization step over the domain $\mathcal{X}$ and we specify the parameter $\rho$ .

As a simple easy case, let us consider the very specific case in which $\mathcal{X} = \Delta_n$ , that is the domain is just the $n$ -dimensional simplex. A LLO in this case could be constructed by solving the following optimization problem,

$\begin{eqnarray*} &&\min_{y\in\Delta_n}y^{\top}c \\ && \textrm{s.t. } \Vert{x-y}\Vert_1 \leq \sqrt{n}r \end{eqnarray*}$

where $\Vert{x-y}\Vert_1 = \sum_{i=1}^n\vert{x(i)-y(i)}\vert$ .

Denote by $y^*$ the optimal solution to the above problem. One can verify that since $\Vert{x-y^*}\Vert_1 \leq \sqrt{n}r$ it follows that $c^{\top}y^* \leq c^{\top}z$ $\forall{}z\in\Delta_n\cap\mathbb{B}_r(x)$ . Moreover it holds that $\Vert{y^*-x}\Vert_2 \leq \Vert{y^*-x}\Vert_1 \leq \sqrt{n}r$ . Thus solving the above $\ell_1$ -constrained linear problem yields a LLO for the simplex with parameter $\rho = \sqrt{n}$ . Most importantly, the above $\ell_1$ -constrained problem is solvable using only a single linear minimization step over $\Delta_n$ and additional computation that is polynomial in the number of non-zeros in the input point $x$ . To see this observe that $y^*$ is just the outcome of taking weight of $\sqrt{n}r/2$ from the non-zero entries of $x$ that correspond to the largest (singed) entries in the vector $c$ and moving it entirely to a single entry that corresponds to the smallest (signed) entry in $c$ . This just requires to check for each non-zero entry in $x$ the value of the corresponding entry in $c$ , sorting these values and reducing weights until a total weight of $\sqrt{n}r/2$ has been reduced. Finding the smallest entry in $c$ , although a trivial operation, is just a linear minimization problem over the simplex.

What about the general case in which $\mathcal{X}$ is some arbitrary polytope? We would like to generalize the construction for the simplex to arbitrary polytopes.

Given the input point to the LLO $x\in\mathcal{X}$ let us write $x$ as a convex combination of vertices of the polytope, that is $x = \sum_i\lambda_iv_i$ where $\lambda_i > 0$ , $\sum_i\lambda_i=1$ and each $v_i$ is a vertex of $\mathcal{X}$ . Suppose now that there exists a constant $\mu = \mu(\mathcal{X})$ such that given a point $z\in\mathcal{X}$ which satisfies $\Vert{x-z}\Vert_2 \leq r$ , $z$ could be written also as a convex combination of vertices of the polytope, $z=\sum_i\gamma_iv_i$ such that $\Vert{\lambda - \gamma}\Vert_1 \leq \mu\cdot{}r$ . Denote by $m$ the number of vertices of $\mathcal{X}$ and by $\lbrace{v_i}\rbrace_{i=1}^m$ the set of these vertices. Consider the following optimization problem,

$\begin{eqnarray*} &&\min_{\gamma\in\Delta_m}\sum_{i=1}^m\gamma_iv_i^{\top}c \\ && \textrm{s.t. } \Vert{\lambda-\gamma}\Vert_1 \leq \mu\cdot{}r \end{eqnarray*}$

Since the above problem is again a linear $\ell_1$ -constrained optimization problem over the simplex we know that it is solvable using only a single call to the linear oracle of $\mathcal{X}$ plus time that is polynomial in the number of non-zeros in $\lambda$ (which is at most $t+1$ after $t$ iterations of the algorithm and in particular does not depend on $m$ which may be exponential in $n$ ). Let $\gamma^*$ be an optimal solution to the above problem and let $y^* = \sum_{i=1}^m\gamma^*_iv_i$ . Then clearly from our definition of $\mu$ we have that $\forall{z}\in\mathcal{X}\cap\mathbb{B}_r(x)$ it holds that $c^{\top}y^* \leq c^{\top}z$ . Also it is not hard to show that $\Vert{x-y^*}\Vert_2 \leq \mu{}Dr$ . Thus if indeed such a constant $\mu$ exists then we have a construction of a LLO with parameter $\rho = \mu{}D$ that requires only a single call to a linear optimization oracle of $\mathcal{X}$ .

The following theorem states the convergence rate of the proposed method .

Theorem 1: Given a polytope $\mathcal{X}$ with parameter $\mu$ as defined above, there exists an algorithm that after $t$ linear minimization steps over the domain $\mathcal{X}$ produces a point $x_t\in\mathcal{X}$ such that

$\begin{eqnarray*} f(x_t) - f(x^*) \leq \left({f(x_0)-f(x^*)}\right)e^{-\frac{1}{4Q\mu^2D^2}t} \end{eqnarray*}$

The following theorem states that indeed a constant $\mu$ as suggested above exists for any polyhedral set and gives its dependence on geometric proprieties of the polytope.

Theorem 2: Let $\mathcal{X}$ be a polytope defined as $\mathcal{X} = \lbrace{x\in\mathbb{R}^n \, | \, A_1x = b_1 \, A_2x \leq b_2}\rbrace$ such that $A_2\in\mathbb{R}^{m_2\times{n}}$ . Assume that there exists parameters $\psi >0,\xi >0$ such that:

1. for any matrix $M$ which consists of at most $n$ linearly independent rows of $A_2$ it holds that $\Vert{M}\Vert_2 \leq \psi$ (here $\Vert{M}\Vert_2$ denotes the spectral norm of $M$ ).

2. For every vertex $v$ of $\mathcal{X}$ and every row $A_2(i)$ of $A_2$ it holds that either $A_2(i)^{\top}v = b_2(i)$ or $A_2(i)^{\top}v \leq b_2(i) - \xi$ (that is, given a vertex and an inequality constraint, the vertex either satisfies the constraint with equality or is far from satisfying it with an equality).

Then $\mu \leq \frac{\sqrt{n}\psi}{\xi}$ .

We now turn to point out several examples of polyhedral sets for which tailored combinatorial algorithms for linear optimization exist and for which the bound on $\mu$ given in Theorem 2 is reasonable.

The simplex

The $n$ -dimensional simplex can be written in the form of linear constraints as $\Delta_n = \lbrace{x \, | \, -\textbf{I}x \leq \vec{0}, \, \vec{1}\cdot{}x = 1}\rbrace$ . Its not hard to see that for the simplex it holds that $\psi = \xi = 1$ and thus by Theorem 2 $\mu \leq \sqrt{n}$ . In particular we get that when applying the proposed algorithm to the problem $\min_{x\in\Delta_n}\Vert{x}\Vert_2^2$ an error $\epsilon$ is achieved after $O(n\log(1/\epsilon))$ iterations which as stated here, is nearly optimal.

The flow polytope

Given a directed acyclic graph with $m$ edges, a source node marked $s$ and a target node marked $t$ , every path from $s$ to $t$ in the graph can be represented by its identifying vector, that is a vector in $\lbrace{0,1}\rbrace^m$ in which the entries that are set to 1 correspond to edges of the path. The flow polytope of the graph is the convex-hull of all such identifying vectors of the simple paths from $s$ to $t$ . This polytope is also exactly the set of all unit $s-t$ flows in the graph if we assume that each edge has a unit flow capacity (a flow is represented here as a vector in $\mathbb{R}^m$ in which each entry is the amount of flow through the corresponding edge). For this polytope also it is not hard to see that $\psi = \xi = 1$ and thus $\mu \leq \sqrt{n}$ . Since the flow polytope is just the convex-hull of $s-t$ paths in the graph, minimizing a linear objective over it amounts to finding a minimum weight path given weights for the edges.

The doubly-stochastic matrices polytope

Doubly-stochastic matrices are squared real-valued matrices with non-negative entries in which the sum of entries of each row and each column amounts to 1. Writing down the linear inequalities and equalities that define this polytope yields that here also $\mu \leq \sqrt{n}$ .

The Birkhoff Von Neumann theorem states the this polytope is the convex hull of exactly all $n\times{n}$ permutation matrices. Since a permutation matrix corresponds to a perfect matching in a fully connected bipartite graph, linear minimization over this polytope corresponds to finding a minimum weight perfect matching in a bipartite graph.

Matroid polytopes

A matroid is pair $(E,I)$ where $E$ is a set of elements and $I$ is a set of subsets of $E$ called the independent sets which satisfy various interesting proprieties that resemble the concept of linear independence in vector spaces.

Matroids have been studied extensively in combinatorial optimization and a key example of a matroid is the graphical matroid in which the set $E$ is the set of edges of a given graph and the set $I$ is the set of all subsets of $E$ which are cycle-free. In particular $I$ in this case contains all the spanning trees of the graph. A subset $S\in{I}$ could be represented by its identifying vector which lies in $\lbrace{0,1}\rbrace^{\vert{E}\vert}$ which also gives rise to the matroid polytope which is just the convex hull of all identifying vectors of sets in $I$ . It can be shown that the matroid polytope can be defined by exponentially many linear inequalities (exponential in $\vert{E}\vert$ ) which makes optimization over it a non-trivial issue since the \textit{ellipsoid method} needs to be used which is highly impractical. Moreover a separation oracle for this polytope which runs in polynomial time in $\vert{E}\vert$ exits however it is also fairly complicated. On the contrary, linear optimization over matroid polytopes is very easy using a simple greedy procedure which runs in nearly linear time. It can be shown that for the matroid polytope it holds that $\xi =1$ and $\psi < n^{3/2}$ where $n = \vert{E}\vert$ . Thus $\mu < n^{3/2}$ .

An interesting application of the above method presented here, which was also our initial motivation for studying this problem, is for the setting of online convex optimization (sometimes termed online learning). Combining the above algorithm with the analysis of an online Frank-Wolfe method presented in this paper by Hazan and Kale from ICML 2012 results in an algorithm for online convex optimization with an iteration complexity that amounts to a single linear optimization step over the domain instead of projection computation which can be much more involved. This algorithm has optimal regret in terms of the game length which answers an open question by Kalai and Vempala posed in this paper from COLT 2003 and by Hazan and Kale from ICML 2012. Further applications of the method include Frank-Wolfe like algorithms for stochastic and non-smooth optimization.

We end this post with some further research questions.

The method presented in this post clearly holds only for “nicely shaped” polytopes because of the dependency of the constant $\mu$ on $\psi, \xi$ . In particular if we take $\mathcal{X}$ to be the euclidean unit ball which could be defined by infinitely-many linear inequalities we will have that $\xi = 0$ and the analysis breaks down. So, an interesting open question is

Question 3: Is there a CG method with a linear convergence rate for smooth and strongly-convex optimization over arbitrary convex sets? In particular is the rate suggested in Question 2 (see Part I) attainable?

Guest post by Dan Garber and Elad Hazan: The Conditional Gradient Method, A Linear Convergent Algorithm – Part II/II

Categories

Recent Posts

Archives

Blogroll