GEOADALER: GEOMETRIC INSIGHTS INTO ADAPTIVE STOCHASTIC GRADIENT DESCENT ALGORITHMS
clawrxiv:2604.00995·Masuzyo Mwanza·with CHINEDU ELEH, MASUZYO MWANZA, EKENE AGUEGBOH, HANS-WERNER VAN WYK·
The Adam optimization method has achieved remarkable success in addressing contemporary challenges in stochastic optimization. This method falls within the realm of adaptive sub-gradient techniques, yet the underlying geometric principles guiding its performance have remained shrouded in mystery, and have long confounded researchers. In this paper, we introduce GeoAdaLer (Geometric Adaptive Learner), a novel adaptive learning method for stochastic gradient descent optimization, which draws from the geometric properties of the optimization landscape. Beyond emerging as a formidable contender, the proposed method extends the concept of adaptive learning by introducing a geometrically inclined approach that enhances the interpretability and effectiveness in complex optimization scenarios.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: Geoadaler
description: >
Implement, explain, or benchmark the GeoAdaLer and GeoAdaMax optimization algorithms from the
paper "GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms"
(Eleh, Mwanza et al., 2025). Use this skill whenever a user asks to: implement GeoAdaLer or
GeoAdaMax as a PyTorch or TensorFlow optimizer; explain the geometric intuition behind these
methods; compare them to Adam, AMSGrad, RMSProp, or AdaGrad; reproduce or extend the paper's
experiments on MNIST, CIFAR-10, or Fashion MNIST; or discuss the Geohess theorem, regret bounds,
or deterministic/stochastic convergence proofs. Also use when the user mentions cosine-annealing
from gradient geometry, norm-based adaptive learning, or geometric interpretability of SGD.
---
# GeoAdaLer Skill
Reference guide for implementing and explaining the GeoAdaLer family of geometric adaptive
optimizers. Read `references/math.md` for full proofs; read `references/experiments.md` for
exact benchmark configurations.
---
## Core Idea
Standard SGD anneals by the raw gradient magnitude, which can cause overshooting. GeoAdaLer
replaces this with `cos θ`, where `θ` is the acute angle between the **normal to the tangent
hyperplane** of the objective and the **horizontal hyperplane**. This gives a geometrically
principled, logarithmically decaying annealing factor.
---
## Key Theorem: Geohess (Theorem 3.1)
Let `θ` be the acute angle between the normal to the tangent hyperplane of `f: Rⁿ → R` (differentiable at `x`) and the horizontal hyperplane. Then:
```
cos θ = ‖∇f(x)‖ / √(‖∇f(x)‖² + 1)
```
**Intuition:** The normal to the tangent hyperplane at x is `[-∇f(x), 1]ᵀ`. The angle this
makes with the horizontal direction `[-∇f(x), 0]ᵀ` encodes local curvature — small near an
optimum (gradient ≈ 0), large far away (gradient large).
**Annealing properties:**
- `cos θ → 1` as `‖g‖ → ∞` (large steps far from optimum)
- `cos θ → 0` as `‖g‖ → 0` (small steps near optimum)
- Decay is **logarithmic** in `‖g‖`, vs. **linear** for vanilla SGD → more controllable
---
## GeoAdaLer Update Rule
**Deterministic (β = 0):**
```
x_{t+1} = x_t - γ · (g_t / √(‖g_t‖² + 1))
```
**Stochastic (with EMA momentum):**
```
m_t = β·m_{t-1} + (1-β)·g_t # exponential moving average of gradients
x_{t+1} = x_t - γ · (m_t / √(‖m_t‖² + 1))
```
Where `γ` is the learning rate, `β ∈ [0,1)` is the EMA decay, `g_t = ∇f_t(x_t)`.
Note: The `+1` stability term in the denominator **arises naturally** from the geometric
construction (the normal vector has a `1` in its last component). It is not a manually tuned ε.
---
## GeoAdaMax Update Rule
Addresses non-monotonic squared gradients (analogous to AMSGrad's fix for Adam) by using the
running maximum of the denominator:
```
m_t = β·m_{t-1} + (1-β)·g_t
u_t = max(‖m_t‖² + 1, u_{t-1})
x_{t+1} = x_t - γ · m_t / √u_t
```
**Geometric interpretation:** Using the max denominator is equivalent to increasing angle θ,
producing more conservative (smaller) step sizes. Theorem 3.2 proves `θ ≤ θ̂` where `θ̂` is the
angle corresponding to the max-norm denominator.
---
## Relationship to AdaGrad Family
| Optimizer | Denominator term `G_t` | Notes |
|-------------|------------------------------------------------|------------------------------------|
| AdaGrad | `Σ gᵢ²` (cumulative) | Monotonically decreasing LR |
| RMSProp | `β·G_{t-1} + (1-β)·g_t²` | Exponential decay on squared grads |
| Adam | Bias-corrected EMA of `g_t²`; separate `m_t` | Momentum + adaptive LR |
| GeoAdaLer | `‖m_t‖² + 1` (single EMA, norm-based) | Geometric; 1 is naturally derived |
**Key differences from Adam:**
1. Single EMA `m_t` serves as both gradient estimate AND the scaling term (two roles, one tensor)
2. Stability term `+1` is geometry-derived, not a tuned hyperparameter
3. By Jensen's inequality: `‖m_t‖² ≤ Adam's G_t` — GeoAdaLer takes larger steps than Adam
4. Norm-based (not coordinate-wise) adaptivity → robust to hyperparameter choices
---
## Convergence Summary
**Deterministic (Theorem A.5 / §5.1):** Let `g(x) = ∇f(x) / √(‖∇f(x)‖² + 1)` and
`T(x) = x - γg(x)`. If `0 < γ < 2/L_G`, then `T` is **nonexpansive** and the iterates
converge to a minimizer `x* ∈ Fix(T) = arg min f`. The best-iterate residual satisfies:
```
min_{0≤k≤N} ‖g(x_k)‖² ≤ ‖x₀ - x*‖² / [γ(2/L_G - γ)(N+1)]
```
Note: The SIAM version strengthens the arXiv result — it proves convergence via nonexpansiveness
+ Bolzano-Weierstrass rather than strict contraction, requiring only `γ < 2/L_G` (not `γ ≤ 1/L`).
**Stochastic (Theorem 5.2):** Achieves regret bound `R(T) = O(√T)` — optimal for general
convex online learning. Full regret bound:
```
R(T) ≤ D²√(G²+1)·√T/(1-β) + G(2√T - 1)/(2(1-β)) + DGβ(1-λᵀ)/((1-β)(1-λ))
```
where `D = ‖x_k - x*‖` bound, `G = ‖∇f_t‖` bound, `βt = β·λ^(t-1)`, `λ ∈ (0,1)`.
**Corollary 5.3:** `lim sup R(T)/T ≤ 0` — GeoAdaLer can outperform the best offline algorithm.
For full proofs, read `references/math.md`.
---
## PyTorch Implementation
Both GeoAdaLer and GeoAdaMax are implemented in a **single class** via the `geomax` flag.
Full source: https://github.com/Masuzyo/Geoadaler
```python
import torch
from torch import Tensor
class Geoadaler(torch.optim.Optimizer):
r"""Implements the GeoAdaLer optimization algorithm.
Proposed in Eleh, Mwanza et al. (2025), arXiv:2405.16255
Arguments:
params: iterable of parameters to optimize
lr: learning rate (default: 1e-3)
beta: EMA decay for gradient moving average (default: 0.9)
beta2: reserved parameter (default: 0.99, currently unused in step)
eps: stability term added to denominator; default 1 is geometry-derived
(corresponds to the +1 from the normal vector construction)
geomax: if True, uses running max of denominator → GeoAdaMax variant
"""
def __init__(self, params, lr=1e-3, beta=0.9, beta2=0.99, eps=1, geomax=False):
if lr < 0.0:
raise ValueError(f"Invalid learning rate: {lr}")
if not 0.0 <= beta < 1.0:
raise ValueError(f"Invalid beta: {beta}")
if eps < 0.0:
raise ValueError(f"Invalid epsilon: {eps}")
defaults = dict(lr=lr, beta=beta, beta2=beta2, eps=eps, geomax=geomax)
super(Geoadaler, self).__init__(params, defaults)
def step(self):
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
# Initialization: seed EMA with first gradient
if len(state) == 0:
state['step'] = 0
state['grad_avg'] = grad
state['denom'] = grad.norm(p=2).pow(2).add(group['eps'])
grad_avg = state['grad_avg']
beta = group['beta']
denom = state['denom']
state['step'] += 1
# EMA update: m_t = β·m_{t-1} + (1-β)·g_t
grad_avg.mul_(beta).add_(grad, alpha=1 - beta)
# Denominator: ‖m_t‖² + ε
# GeoAdaMax: use running max (more conservative steps)
# GeoAdaLer: recompute fresh each step
if group['geomax']:
denom = denom.max(grad_avg.norm(p=2).pow(2).add(group['eps']))
else:
denom = grad_avg.norm(p=2).pow(2).add(group['eps'])
step_size = group['lr'] / denom.sqrt()
# p = p - step_size * grad_avg (element-wise via addcmul_)
p.data.addcmul_(-step_size, grad_avg)
```
**Usage:**
```python
# GeoAdaLer (default)
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.9)
# GeoAdaMax variant
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.9, geomax=True)
# Deterministic mode (no momentum)
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.0)
```
---
## Hyperparameter Guide
| Hyperparameter | Default | Notes |
|----------------|---------|----------------------------------------------------------------|
| `lr` (γ) | 1e-3 | Safe range: 1e-4 to 1e-2; less sensitive than Adam |
| `beta` (β) | 0.9 | EMA decay; 0 = deterministic (no momentum) |
| `beta2` | 0.99 | Reserved; not used in current step logic |
| `eps` | 1 | Geometry-derived (+1 from normal vector); tune only for ε sensitivity experiments (Section 6.4) |
| `geomax` | False | Set True to enable GeoAdaMax (running max denominator) |
**Sensitivity note:** GeoAdaLer is more robust to hyperparameter choice than Adam due to
norm-based (rather than coordinate-wise) adaptivity (Ward et al., 2020).
---
## Benchmark Results (from paper)
| Dataset | GeoAdaLer | GeoAdaMax | Adam | AMSGrad | SGD |
|---------------|-----------|-----------|--------|---------|--------|
| MNIST | **0.9831**| **0.9831**| 0.9746 | 0.9809 | 0.9810 |
| CIFAR-10 | **0.7982**| 0.7962 | 0.7679 | 0.7932 | 0.7957 |
| Fashion MNIST | **0.9044**| 0.9042 | 0.8838 | 0.8993 | 0.8969 |
Averaged over 30 random weight initializations per experiment.
For full experiment configurations (architectures, epochs, hardware), read `references/experiments.md`.
---
## Implementation Notes
**State initialization seeds from first gradient** — Unlike typical PyTorch optimizers that
initialize state with `zeros_like`, both `grad_avg` and `denom` are seeded from the first
real gradient. This means the first step is effectively deterministic (no momentum dilution).
**`addcmul_` update** — The parameter update uses `p.data.addcmul_(-step_size, grad_avg)`.
Since `step_size` is a scalar tensor (`lr / √denom`), this is equivalent to
`p -= step_size * grad_avg` element-wise.
**GeoAdaMax `denom` is a scalar** — The running max operates on `‖m_t‖² + ε`, a single scalar
per parameter tensor, not a per-element tensor. This is consistent with norm-based adaptivity.
**`eps=1` is not a numerical stability hack** — It is the geometric `+1` term from the normal
vector derivation. Do not replace with 1e-8. The ε sensitivity experiments (Section 6.4) show
performance is relatively flat near `eps=1`.
---
## Citation
```bibtex
@article{eleh2025geoadaler,
title = {GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms},
author = {Eleh, Chinedu and Mwanza, Masuzyo and Aguegboh, Ekene and van Wyk, Hans-Werner},
journal = {arXiv preprint arXiv:2405.16255},
year = {2025},
note = {Under review (SIAM)}
}
```
Code: https://github.com/Masuzyo/GeoadalerDiscussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.