Optimization for Big Data, Part 2

Enrique Bonilla
Datank.ai Blog
Published in
7 min readMay 8, 2018

--

For reading part 1 see the following link.

NSync

The NSync algorithm (Richtárik et al.) is another stochastic iterative algorithm. This method, in contrast to Stochastic Gradient Descent, does not sample observations to reduce the amount of computation, but it samples variables. As a consequence, in each iteration of the algorithm, the n observations in our database are projected to the random subspace generated by the

sampled variables, and instead of updating all the weights at once, just the weights that correspond to this variables are updated.

This algorithm requires more assumptions on the function L(w) than Stochastic Gradient Descent to find a solution w*. This method assumes that L is smooth and strongly convex. This implies that NSync can be used in fewer optimization functions L and fewer machine learning models. However, this is compensated by the fact that, under adequate conditions, NSync finds w* faster.

Given a vector of step sizes

The algorithm’s pseudo code is the following:

It is important to observe that

as the first refers to the partial derivative of L with respect to the parameter

and the second is a random estimate of the gradient of L. Also, there are other few crucial details to explain about this algorithm.

Opposed to Stochastic Gradient Descent, NSync allows arbitrary sampling. This means that the probability of sampling a subset

of variables, i.e.

at each iteration can follow any probability distribution. For example, if you have four variables 1, 2, 3, 4 you can set it’s sampling distribution as follows

or construct it with any other values you want.

A requirement for

is that the probability of sampling the i-th variable, i.e.

must be greater than zero for every i. If this condition is not satisfied, there is a weight

that is never going to be sampled, and hence is never going to be updated. As a consequence, the algorithm will never converge to an optimal solution. Also, note that the values assigned to every probability of sampling a weight define the expected size of your sample. For example, if

for all i=1,…,d then in every iteration the algorithm will be sampling all variables.

Even if theoretically

can follow any distribution, there are some distributions that are easier to use for generating samples. Uniform sampling, importance sampling, t-nice and importance t-nice are some examples constructed with easy-to-handle probability distributions (Richtárik et al.). Uniform sampling and importance sampling just sample one variable at each iteration. The difference between this two is that, whereas uniform sampling assigns

for all weights, importance sampling designates the value of

according to how useful it is to update the i-th weight so the algorithm converges faster to w*. As can be intuitively imagined, NSync with importance sampling tends to find a good solution faster than NSync with uniform sampling (Richtárik et al.).

T-nice sampling, equivalently as for Stochastic Gradient Descent samples, at every iteration of the algorithm, exactly t>1 variables. While the original version of t-nice assigns the same probability to all batches of size t, importance t-nice assign different probabilities to each batch in order to improve the performance of the algorithm. When using any version of the t-nice sampling this algorithm can be parallelized.

Another important observation is that this algorithm does not use a single step size

for updating every weight. This algorithm uses different step sizes

. Depending on your choice of the vector of step sizes V the properties the algorithm is going to have. Particularly, for loss functions of the form

where

is a Lipschitz-smooth function, parameters

can be computed so that the Nsync converges linearly (Richtárik et al.).

Even if the conditions mentioned in the paragraph above seem very abstract, a lot of machine learning models, with its usual loss, fulfill these criteria. Linear Regression, Regularized Linear Regression, and Logistic Regression are some examples. This makes NSync a reliable optimization algorithm for many machine learning models.

dfSDCA

The Dual Free Stochastic Dual Coordinate Ascent or dfSDCA (Shalev-Shwartz et al.) is another algorithm that works with Random estimates

of the gradient of L. As NSync, this algorithm makes assumptions on the loss function. If the loss function is of the form

it assumes that

is

Lipschitz-smooth and convex, not strongly convex as with NSync. The fact that convexity it’s assumed, and not strong convexity, makes this method capable of optimizing a bigger family of loss functions than Nsync. This algorithm also allows arbitrary sampling.

The pseudo code of dfsDCA is the following:

This method, as Nsync, can be used with a uniform, an importance, a t-nice, and an importance t-nice sampling (Richtárik et al.). It also can be implemented with adaptive probabilities, i.e. in each iteration of the algorithm the probability of choosing a certain observation changes, so that observations that have not been sample yet have more probability of being chosen (Richtarik et al.). The idea behind dfsDCA is to imitate the Stochastic Dual Coordinate Ascent algorithm, which uses information of the dual optimization problem to converge faster. Nevertheless, this algorithm does not require to know and use the dual problem.

Comparison of the three methods

One of the main advantages of dfsDCA and NSync over Stochastic Gradient Descent is that they allow arbitrary sampling. As a consequence, the structure of our data can be exploited to obtain faster converge to w*. Additionally, when comparing dfsDCA with Stochastic Gradient Descent, since both work with random estimates of the gradient of L, there have been some results showing that dfsDCA converges much faster than Stochastic Gradient Descent (Richtárik et al.). However, the advantage of Stochastic Gradient Descent is that it makes fewer assumptions on L, and hence it can be used in a wider variety of optimization functions.

The fundamental difference between dfsDCA, Stochastic Gradient Descent, and NSync, is that the first two sample over n and the second over d. This, simply put, makes dfsDCA and Stochastic Gradient Descent work better when your database has many observations, while NSync is faster when you have many variables. However, while this is frequently true, there are a bunch of other factors to consider.

For example, in Richtárik et al. the authors compare dfsDCA and Nsync with uniform sampling, showing that when n << d, Nsync is faster than dfsDCA, and conversely, when d < < n, dfsDCA is faster. However, the size of d and n are not the only factors that need to be considered. The sparsity of the n rows and the d columns of your database also have a considerable effect on which algorithm performs better. A good practice to choose an algorithm is to compute the cost of implementation of each considered algorithm, and implement the one with the lowest cost.

As a conclusion of this comparison: one can say that there is not a single algorithm that is suitable for every occasion. Each algorithm has its advantages and disadvantages. The person in charge of selecting which optimization method to use should analyze the problem, and select the more convenient method. By doing so, one can optimize its machine learning model adequately and fast.

--

--

Data Scientist at Datank. Bsc in Applied Mathematics at ITAM. MSc in Data Science at the University of Edinburgh.