Optimization for Big Data, Part 1

Enrique Bonilla
Datank.ai Blog
Published in
7 min readMar 13, 2018

--

This article is based on the lectures imparted by Peter Richtárik in the Modern Optimization Methods for Big Data class, at the University of Edinburgh, in 2017. Thank you for such a great class.

Introduction

A fundamental task when building a model in Machine Learning is to determine an optimal set of values for the model’s parameters, so that it performs as best as possible. As we choose better values, we get finer predictions, or fitting. For this reason, it is common to use the area of mathematical optimization and apply the available methods to fit a certain model to our data.

However classic optimization methods, such as Gradient Descent and Newton’s Method, struggle to fit a model in the presence of big data. This is because they need to compute functions that depend on a lot of data; for example, a whole evaluation of the Hessian matrix could not fit in memory.

Because of this, stochastic methods such as Stochastic Gradient Descent have been developed. These methods regularly use a random subspace of the feasible space, or a random estimate of the optimization function, to facilitate the computation of an optimal value for an optimization problem.

This is the first of a two parts article, here we will describe one of the most frequent optimization problems found in machine learning. Following this, we will give a brief explanation of classic iterative non-stochastic methods and how they are used to solve it. Also, in the context of iterative methods, we will introduce the reader to how stochastic methods work and why they are a suitable solution when dealing with big amounts of data. Finally, a very basic but effective stochastic method, Stochastic Gradient Descent, will be illustrated.

In the second part we will talk about two more complex methods: the NSync algorithm (Richtárik et al.) and the Dual Free Stochastic Dual Coordinate Ascent or dfSDCA (Shalev-Shwartz et al.). Peculiarly, this two methods can take advantage of the particularities of the optimization problem and outperform classic stochastic methods such as Stochastic Gradient Descent, under certain circumstances (Richtárik et al.). This part of the article will also include a comparison between the three presented methods and will advise the reader on how to select a method for its particular problem.

Formulation of the problem

In Supervised Learning, the task of finding the best parameter values given data is commonly considered an optimization problem.

Given a dataset

with n observations and d variables, a target variable

a vector of parameters (weights)

a loss function

and a function

we define the problem as:

Here, f is our machine learning model that, given a particular value of weights, outputs a prediction for an observation. Additionally, the loss function loss defines how each configuration of w is going to be penalized according to f and it’s associated target.

This optimization problem can be interpreted as finding the value w* that minimizes the error between what we observed in the dependent variable Y of a training set, and its corresponding prediction

Also, in a more formal way, it can be understood as estimating the value w* that minimizes a Monte Carlo approximation

of the expected loss

where

is the joint distribution of the co-variables selected for the model and the dependent variable. Note that

are random variables, while (X, Y) are realizations of the random variables.

Some concrete examples where this formulation is used to find optimal weights are for a linear regression

using a quadratic loss

and something much more complex such as finding good weight values for a Neural Network.

Iterative methods

A variety of methods could be used to solve this problem. The kind of methods available and the quality of the solution depend on the complexity of f and L. For continuously differentiable and convex functions L, particularly, iterative methods such as Gradient Descent are a reliable option for finding an optimal solution.

Iterative methods start with a feasible point

and generate iteratively a sequence of points

such that each new point is closer, according to some sense or metric, to an optimal solution w*. Each point in the sequence is generated by the following rule:

Where

is the descent direction,

is the step size and

is the step.

This method only produces approximate solutions to w*. The step size and the descent direction can be determined in different ways: the descent direction, for example, can be calculated using the first or second derivative of the Monte Carlo approximation L with respect to w, evaluated in the current point, i.e.

The stopping criterion of the algorithm depends, commonly, on how close is the generated point at iteration i to the optimal solution w*, this could be measured by evaluating

The problem with classic iterative methods is that when dealing with a big database, with either big n or big d, the descent direction could be very expensive to compute. For example, computing the gradient

could be very difficult because either a lot of

need to be computed (big n) or a lot of partial derivatives

need to be stored (big d). Thus, stochastic iterative methods are a decent solution for optimizing a problem in this case.

Stochastic methods use a random subset S of size s of the n observations, or a random subset T of size t of the d variable (s < < n, t < < d), instead of calculating the descent direction with all the available data at once. As a consequence, the computation of the descent direction is significantly cheaper. Classic iterative methods are designed so that we get a good approximation of w* with just a few iterations. However, since computing the descent direction is expensive with big data, each iteration could take hours. On the other hand, stochastic iterative methods need more iterations to converge, but since computing each iteration is less expensive, they can easily overcome classic methods if the random subsets and step size are adequately chosen.

Stochastic Gradient Descent

Stochastic Gradient Descent is the simplest and yet the most common randomized algorithm found. This method uses random estimates instead of using the whole gradient of L as a descent direction.

Given that

is a random estimate of the Gradient, rather than using the full dataset to compute

this method generates

with a random subset of observations

As it can be seen, in the long run, updating

with various samples will have the same effect as updating

all at once.

The big difference is that handling a few observations in each iteration can be computationally more efficient than handling all observations. The Stochastic Gradient Descent algorithm can be written as follows:

Normally, the size of the sample s is set to 1, and if s>1 the algorithm is called the s-nice or mini-batch Stochastic Gradient Descent. The advantage of taking s > 1 is that it can be parallelized so that different computers can handle different samples, making it more efficient. Even if this algorithm seems very simple, it can be very effective. An illustration of how effective this algorithm is, is that it’s frequently used to optimize neural networks.

--

--

Data Scientist at Datank. Bsc in Applied Mathematics at ITAM. MSc in Data Science at the University of Edinburgh.