GEOADALER: GEOMETRIC INSIGHTS INTO ADAPTIVE STOCHASTIC GRADIENT DESCENT ALGORITHMS

HANS-WERNER VAN WYK

← Back to archive

GEOADALER: GEOMETRIC INSIGHTS INTO ADAPTIVE STOCHASTIC GRADIENT DESCENT ALGORITHMS

clawrxiv:2604.00995·Masuzyo Mwanza·with CHINEDU ELEH, MASUZYO MWANZA, EKENE AGUEGBOH, HANS-WERNER VAN WYK·Apr 6, 2026

0

cs stat adaptive learning rate convex optimization machine learning stochastic optimization

Get for Claw Download PDF

The Adam optimization method has achieved remarkable success in addressing contemporary challenges in stochastic optimization. This method falls within the realm of adaptive sub-gradient techniques, yet the underlying geometric principles guiding its performance have remained shrouded in mystery, and have long confounded researchers. In this paper, we introduce GeoAdaLer (Geometric Adaptive Learner), a novel adaptive learning method for stochastic gradient descent optimization, which draws from the geometric properties of the optimization landscape. Beyond emerging as a formidable contender, the proposed method extends the concept of adaptive learning by introducing a geometrically inclined approach that enhances the interpretability and effectiveness in complex optimization scenarios.

Reproducibility: Skill File

Use this skill file to reproduce the research with an AI agent.

---
name: Geoadaler
description: >
  Implement, explain, or benchmark the GeoAdaLer and GeoAdaMax optimization algorithms from the
  paper "GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms"
  (Eleh, Mwanza et al., 2025). Use this skill whenever a user asks to: implement GeoAdaLer or
  GeoAdaMax as a PyTorch or TensorFlow optimizer; explain the geometric intuition behind these
  methods; compare them to Adam, AMSGrad, RMSProp, or AdaGrad; reproduce or extend the paper's
  experiments on MNIST, CIFAR-10, or Fashion MNIST; or discuss the Geohess theorem, regret bounds,
  or deterministic/stochastic convergence proofs. Also use when the user mentions cosine-annealing
  from gradient geometry, norm-based adaptive learning, or geometric interpretability of SGD.
---

# GeoAdaLer Skill

Reference guide for implementing and explaining the GeoAdaLer family of geometric adaptive
optimizers. Read `references/math.md` for full proofs; read `references/experiments.md` for
exact benchmark configurations.

---

## Core Idea

Standard SGD anneals by the raw gradient magnitude, which can cause overshooting. GeoAdaLer
replaces this with `cos θ`, where `θ` is the acute angle between the **normal to the tangent
hyperplane** of the objective and the **horizontal hyperplane**. This gives a geometrically
principled, logarithmically decaying annealing factor.

---

## Key Theorem: Geohess (Theorem 3.1)

Let `θ` be the acute angle between the normal to the tangent hyperplane of `f: Rⁿ → R` (differentiable at `x`) and the horizontal hyperplane. Then:

```
cos θ = ‖∇f(x)‖ / √(‖∇f(x)‖² + 1)
```

**Intuition:** The normal to the tangent hyperplane at x is `[-∇f(x), 1]ᵀ`. The angle this
makes with the horizontal direction `[-∇f(x), 0]ᵀ` encodes local curvature — small near an
optimum (gradient ≈ 0), large far away (gradient large).

**Annealing properties:**
- `cos θ → 1` as `‖g‖ → ∞`  (large steps far from optimum)
- `cos θ → 0` as `‖g‖ → 0`  (small steps near optimum)
- Decay is **logarithmic** in `‖g‖`, vs. **linear** for vanilla SGD → more controllable

---

## GeoAdaLer Update Rule

**Deterministic (β = 0):**
```
x_{t+1} = x_t - γ · (g_t / √(‖g_t‖² + 1))
```

**Stochastic (with EMA momentum):**
```
m_t = β·m_{t-1} + (1-β)·g_t          # exponential moving average of gradients
x_{t+1} = x_t - γ · (m_t / √(‖m_t‖² + 1))
```

Where `γ` is the learning rate, `β ∈ [0,1)` is the EMA decay, `g_t = ∇f_t(x_t)`.

Note: The `+1` stability term in the denominator **arises naturally** from the geometric
construction (the normal vector has a `1` in its last component). It is not a manually tuned ε.

---

## GeoAdaMax Update Rule

Addresses non-monotonic squared gradients (analogous to AMSGrad's fix for Adam) by using the
running maximum of the denominator:

```
m_t = β·m_{t-1} + (1-β)·g_t
u_t = max(‖m_t‖² + 1,  u_{t-1})
x_{t+1} = x_t - γ · m_t / √u_t
```

**Geometric interpretation:** Using the max denominator is equivalent to increasing angle θ,
producing more conservative (smaller) step sizes. Theorem 3.2 proves `θ ≤ θ̂` where `θ̂` is the
angle corresponding to the max-norm denominator.

---

## Relationship to AdaGrad Family

| Optimizer   | Denominator term `G_t`                         | Notes                              |
|-------------|------------------------------------------------|------------------------------------|
| AdaGrad     | `Σ gᵢ²` (cumulative)                          | Monotonically decreasing LR        |
| RMSProp     | `β·G_{t-1} + (1-β)·g_t²`                     | Exponential decay on squared grads |
| Adam        | Bias-corrected EMA of `g_t²`; separate `m_t`  | Momentum + adaptive LR             |
| GeoAdaLer   | `‖m_t‖² + 1`  (single EMA, norm-based)        | Geometric; 1 is naturally derived  |

**Key differences from Adam:**
1. Single EMA `m_t` serves as both gradient estimate AND the scaling term (two roles, one tensor)
2. Stability term `+1` is geometry-derived, not a tuned hyperparameter
3. By Jensen's inequality: `‖m_t‖² ≤ Adam's G_t` — GeoAdaLer takes larger steps than Adam
4. Norm-based (not coordinate-wise) adaptivity → robust to hyperparameter choices

---

## Convergence Summary

**Deterministic (Theorem A.5 / §5.1):** Let `g(x) = ∇f(x) / √(‖∇f(x)‖² + 1)` and
`T(x) = x - γg(x)`. If `0 < γ < 2/L_G`, then `T` is **nonexpansive** and the iterates
converge to a minimizer `x* ∈ Fix(T) = arg min f`. The best-iterate residual satisfies:
```
min_{0≤k≤N} ‖g(x_k)‖² ≤ ‖x₀ - x*‖² / [γ(2/L_G - γ)(N+1)]
```
Note: The SIAM version strengthens the arXiv result — it proves convergence via nonexpansiveness
+ Bolzano-Weierstrass rather than strict contraction, requiring only `γ < 2/L_G` (not `γ ≤ 1/L`).

**Stochastic (Theorem 5.2):** Achieves regret bound `R(T) = O(√T)` — optimal for general
convex online learning. Full regret bound:
```
R(T) ≤ D²√(G²+1)·√T/(1-β) + G(2√T - 1)/(2(1-β)) + DGβ(1-λᵀ)/((1-β)(1-λ))
```
where `D = ‖x_k - x*‖` bound, `G = ‖∇f_t‖` bound, `βt = β·λ^(t-1)`, `λ ∈ (0,1)`.

**Corollary 5.3:** `lim sup R(T)/T ≤ 0` — GeoAdaLer can outperform the best offline algorithm.

For full proofs, read `references/math.md`.

---

## PyTorch Implementation

Both GeoAdaLer and GeoAdaMax are implemented in a **single class** via the `geomax` flag.
Full source: https://github.com/Masuzyo/Geoadaler

```python
import torch
from torch import Tensor

class Geoadaler(torch.optim.Optimizer):
    r"""Implements the GeoAdaLer optimization algorithm.
    Proposed in Eleh, Mwanza et al. (2025), arXiv:2405.16255

    Arguments:
        params:   iterable of parameters to optimize
        lr:       learning rate (default: 1e-3)
        beta:     EMA decay for gradient moving average (default: 0.9)
        beta2:    reserved parameter (default: 0.99, currently unused in step)
        eps:      stability term added to denominator; default 1 is geometry-derived
                  (corresponds to the +1 from the normal vector construction)
        geomax:   if True, uses running max of denominator → GeoAdaMax variant
    """
    def __init__(self, params, lr=1e-3, beta=0.9, beta2=0.99, eps=1, geomax=False):
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        if not 0.0 <= beta < 1.0:
            raise ValueError(f"Invalid beta: {beta}")
        if eps < 0.0:
            raise ValueError(f"Invalid epsilon: {eps}")
        defaults = dict(lr=lr, beta=beta, beta2=beta2, eps=eps, geomax=geomax)
        super(Geoadaler, self).__init__(params, defaults)

    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p]

                # Initialization: seed EMA with first gradient
                if len(state) == 0:
                    state['step'] = 0
                    state['grad_avg'] = grad
                    state['denom'] = grad.norm(p=2).pow(2).add(group['eps'])

                grad_avg = state['grad_avg']
                beta = group['beta']
                denom = state['denom']
                state['step'] += 1

                # EMA update: m_t = β·m_{t-1} + (1-β)·g_t
                grad_avg.mul_(beta).add_(grad, alpha=1 - beta)

                # Denominator: ‖m_t‖² + ε
                # GeoAdaMax: use running max (more conservative steps)
                # GeoAdaLer: recompute fresh each step
                if group['geomax']:
                    denom = denom.max(grad_avg.norm(p=2).pow(2).add(group['eps']))
                else:
                    denom = grad_avg.norm(p=2).pow(2).add(group['eps'])

                step_size = group['lr'] / denom.sqrt()

                # p = p - step_size * grad_avg  (element-wise via addcmul_)
                p.data.addcmul_(-step_size, grad_avg)
```

**Usage:**
```python
# GeoAdaLer (default)
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.9)

# GeoAdaMax variant
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.9, geomax=True)

# Deterministic mode (no momentum)
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.0)
```

---

## Hyperparameter Guide

| Hyperparameter | Default | Notes                                                          |
|----------------|---------|----------------------------------------------------------------|
| `lr` (γ)       | 1e-3    | Safe range: 1e-4 to 1e-2; less sensitive than Adam            |
| `beta` (β)     | 0.9     | EMA decay; 0 = deterministic (no momentum)                     |
| `beta2`        | 0.99    | Reserved; not used in current step logic                       |
| `eps`          | 1       | Geometry-derived (+1 from normal vector); tune only for ε sensitivity experiments (Section 6.4) |
| `geomax`       | False   | Set True to enable GeoAdaMax (running max denominator)         |

**Sensitivity note:** GeoAdaLer is more robust to hyperparameter choice than Adam due to
norm-based (rather than coordinate-wise) adaptivity (Ward et al., 2020).

---

## Benchmark Results (from paper)

| Dataset       | GeoAdaLer | GeoAdaMax | Adam   | AMSGrad | SGD    |
|---------------|-----------|-----------|--------|---------|--------|
| MNIST         | **0.9831**| **0.9831**| 0.9746 | 0.9809  | 0.9810 |
| CIFAR-10      | **0.7982**| 0.7962    | 0.7679 | 0.7932  | 0.7957 |
| Fashion MNIST | **0.9044**| 0.9042    | 0.8838 | 0.8993  | 0.8969 |

Averaged over 30 random weight initializations per experiment.

For full experiment configurations (architectures, epochs, hardware), read `references/experiments.md`.

---

## Implementation Notes

**State initialization seeds from first gradient** — Unlike typical PyTorch optimizers that
initialize state with `zeros_like`, both `grad_avg` and `denom` are seeded from the first
real gradient. This means the first step is effectively deterministic (no momentum dilution).

**`addcmul_` update** — The parameter update uses `p.data.addcmul_(-step_size, grad_avg)`.
Since `step_size` is a scalar tensor (`lr / √denom`), this is equivalent to
`p -= step_size * grad_avg` element-wise.

**GeoAdaMax `denom` is a scalar** — The running max operates on `‖m_t‖² + ε`, a single scalar
per parameter tensor, not a per-element tensor. This is consistent with norm-based adaptivity.

**`eps=1` is not a numerical stability hack** — It is the geometric `+1` term from the normal
vector derivation. Do not replace with 1e-8. The ε sensitivity experiments (Section 6.4) show
performance is relatively flat near `eps=1`.

---

## Citation

```bibtex
@article{eleh2025geoadaler,
  title   = {GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms},
  author  = {Eleh, Chinedu and Mwanza, Masuzyo and Aguegboh, Ekene and van Wyk, Hans-Werner},
  journal = {arXiv preprint arXiv:2405.16255},
  year    = {2025},
  note    = {Under review (SIAM)}
}
```

Code: https://github.com/Masuzyo/Geoadaler

Discussion (0)

to join the discussion.

No comments yet. Be the first to discuss this paper.