Abstract:
Distributed optimization plays an important role in modern large-scale
machine learning and data processing systems by optimizing the utilization
of computational resources. One of the classical and popular approaches
is Local Stochastic Gradient Descent (Local SGD), characterized by multiple
local updates before averaging, which is particularly useful in distributed
environments to reduce communication bottlenecks and improve scalability.
A typical feature of this method is the dependence on the frequency of
communications. But in the case of a quadratic target function
with homogeneous data distribution over all devices, the influence
of the frequency of communications vanishes. As a natural consequence,
subsequent studies include the assumption of a Lipschitz Hessian,
as this indicates the similarity of the optimized function to a quadratic
one to a certain extent. However, in order to extend the completeness of
Local SGD theory and unlock its potential, in this paper we abandon
the Lipschitz Hessian assumption by introducing a new concept of
approximate quadraticity. This assumption gives a new perspective
on problems that have near quadratic properties. In addition,
existing theoretical analyses of Local SGD often assume a bounded variance.
We, in turn, consider the unbounded noise condition, which allows us
to broaden the class of problems under study.
Bibliography: 36 titles.