Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum
Introduction
In this blog post, I will introduce the stochastic proximal point algorithm with momentum (SPPAM) based on this paper,1 which is forthcoming in Learning for Dynamics and Control (L4DC) 2022.
Background
We focus on unconstrained empirical risk minimization:
Stochastic gradient descent (SGD) has become the de facto method to solve
While computationally efficient, SGD in
Acceleration via Momentum.
With respect to the slow convergence, many variants of accelerated methods have been proposed, including the Polyak’s momentum 4 and the Nesterov’s acceleration 5 methods.
These methods allow faster (sometimes optimal) convergence rates, while having virtually the same computational cost as SGD.
In particular, SGD with momentum (SGDM) iterates as follows:
Yet, SGDM could be hard to tune: SGDM adds another hyperparameter—momentum
Stability via Proximal Updates.
With respect to the numerical stability, variants of SGD that utilize proximal updates have recently been proposed.
The proximal point algorithm (PPA) obtains the next iterate for minimizing
Thanks to this conditioning property, PPA exhibits different behavior compared to GD in the deterministic setting.
For a convex function
Due to this remarkable property, PPA was soon considered in the stochastic setting, dubbed as stochastic proximal iterations (SPI) 10 11 or implicit SGD 12 13. These works generally indicate that, in the asymptotic regime, SGD and SPI/ISGD have the same convergence behavior; but in the non-asymptotic regime, SPI/ISGD outperforms SGD due to numerical stability provided by utilizing proximal updates.
Our Approach: Stochastic PPA with Momentum (SPPAM).
The main question we wanted to answer in this work was as follows:
SPPAM iterates as follows:
Apart from the similarity between SPPAM in
We can get a sense of the behavior of SPPAM from this expression.
First, for large
The quadratic model case
For simplicity, we first consider the convex quadratic optimization problem under the deterministic setting.
Specifically, we consider the objective function:
Proposition 1 (GD)14
To minimize
Proposition 2 (PPA)1
To minimize
Proposition 3 (GDM)14
To minimize
Proposition 4 (PPAM)1
Let
if ; if and ; otherwise.

Given the above propositions, we can study the stability with respect to the step size
In particular, for GD, only a small range of step sizes
In the next section, we will see how this pattern translates to a general strongly convex function
Main Results
In this section, we theoretically characterize the convergence and stability behavior of SPPAM.
We follow the stochastic errors of PPA, as set up in this paper.13
We can then express
Note that
We further assume the following:
Assumption 1
Assumption 2 There exists fixed
Is SPPAM faster than SPPA?
We first study whether SPPAM enjoys faster convergence than SPPA. We start with the iteration invariant bound, which expresses the expected error at
Theorem 5
For
Notice that all terms –except the last one– are divided by
Based on
where
Lemma 6
The maximum eigenvalue of
Notice the one-step contraction factor in
However, due to the additional terms, it is not immediately obvious when SPPAM enjoys faster convergence than SPPA. We thus characterize this condition more precisely in the following corollary:
Corollary 7
For
In words, for a fixed step size
In contrast to (stochastic) gradient method analyses in convex optimization, where acceleration is usually shown by improving the dependency on the condition number from
As shown in Theorem 5, our convergence analysis of SPPAM does not depend on
Is SPPAM more stable than SGDM?
Theorem 9
For
Here,
The above theorem states that the term in
Theorem 10
Let the following condition hold:
To provide more context of the condition in Theorem 10, we make an ``unfair" comparison of
This paper
shows that Nesterov’s accelerated SGD converges to a neighborhood at a linear rate for strongly convex quadratic objective if
Now, consider the standard momentum value
Experiments
In this section, we perform numerical experiments to study the convergence behaviors of SPPAM, SPPA, SGDM, and SGD, using generalized linear models (GLM).
GLM subsumes a wide family of models including linear, logistic, and Poisson regressions. Different models connects the linear predictor

In the top row of above figure, we present the results for the linear regression with different condition numbers, with gaussian noise level
Such trend is much more pronounced for the Poisson regression case presented in the bottom row.
Due to the exponential mean function
Conclusion
We propose the stochastic proximal point algorithm with momentum (SPPAM), which directly incorporates Polyak’s momentum inside the proximal step. We show that SPPAM converges to a neighborhood at a faster rate than stochastic proximal point algorithm (SPPA), and characterize the conditions that result in acceleration. Further, we prove linear convergence of SPPAM to a neighborhood, and provide conditions that lead to an exponential discount of the initial conditions, akin to SPPA. We confirm our theory with numerical simulations on linear and Poisson regression models; SPPAM converges for all the step sizes that SPPA converges, with a faster rate that matches or improves SGDM.
-
Junhyung Lyle Kim, Panos Toulis, and Anastasios Kyrillidis. Convergence and stability of the stochastic proximal point algorithm with momentum. arXiv preprint arXiv:2111.06171, 2021. ↩︎
-
Eric Moulines and Francis R. Bach. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011. ↩︎
-
Robert M Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtarik. SGD: General Analysis and Improved Rates. Proceedings of the 36 th International Conference on Machine Learning, page 10, 2019. ↩︎
-
Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964. ↩︎
-
Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018. ↩︎
-
Chaoyue Liu and Mikhail Belkin. Accelerating sgd with momentum for over-parameterized learning. arXiv preprint arXiv:1810.13395, 2018. ↩︎
-
Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018. ↩︎
-
Mahmoud Assran and Michael Rabbat. On the Convergence of Nesterov’s Accelerated Gradient Method in Stochastic Settings. Proceedings of the 37 th International Conference on Machine Learning, 2020. ↩︎
-
Osman Guler. On the convergence of the proximal point algorithm for convex minimization. ¨ SIAM journal on control and optimization, 29(2):403–419, 1991. ↩︎
-
Ernest K Ryu and Stephen Boyd. Stochastic Proximal Iteration: A Non-Asymptotic Improvement Upon Stochastic Gradient Descent. Author website, page 42, 2017. ↩︎
-
Hilal Asi and John C Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019. ↩︎
-
Panos Toulis and Edoardo M Airoldi. Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4):1694–1727, 2017. ↩︎
-
Panos Toulis, Thibaut Horel, and Edoardo M Airoldi. The proximal robbins–monro method. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(1):188–212, 2021. ↩︎
-
Gabriel Goh. Why momentum really works. Distill, 2(4):e6, 2017. ↩︎
-
Osman Guler. New proximal point algorithms for convex minimization. ¨ SIAM Journal on Optimization, 2(4):649–664, 1992. ↩︎