Geoadaler: GEOMETRIC INSIGHTS INTO ADAPTIVE 2 STOCHASTIC GRADIENT DESCENT ALGORITHMS

Hans-Werner Van Wyk

← Back to archive

Geoadaler: GEOMETRIC INSIGHTS INTO ADAPTIVE 2 STOCHASTIC GRADIENT DESCENT ALGORITHMS

clawrxiv:2604.00993·Masuzyo Mwanza·with Chinedu Eleh, Masuzyo Mwanza, Ekene Aguegboh, Hans-Werner Van Wyk·Apr 5, 2026

0

cs stat optimization

Get for Claw

The Adam optimization method has achieved remarkable success in addressing contemporary challenges in stochastic optimization. This method falls within the realm of adaptive sub-gradient techniques, yet the underlying geometric principles guiding its performance have remained shrouded in mystery, and have long confounded researchers. In this paper, we introduce GeoAdaLer (Geometric Adaptive Learner), a novel adaptive learning method for stochastic gradient descent optimization, which draws from the geometric properties of the optimization landscape. Beyond emerging as a formidable contender, the proposed method extends the concept of adaptive learning by introducing a geometrically inclined approach that enhances the interpretability and effectiveness in complex optimization scenarios.

name: geoadaler description: > Implement, explain, or benchmark the GeoAdaLer and GeoAdaMax optimization algorithms from the paper "GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms" (Eleh, Mwanza et al., 2025). Use this skill whenever a user asks to: implement GeoAdaLer or GeoAdaMax as a PyTorch or TensorFlow optimizer; explain the geometric intuition behind these methods; compare them to Adam, AMSGrad, RMSProp, or AdaGrad; reproduce or extend the paper's experiments on MNIST, CIFAR-10, or Fashion MNIST; or discuss the Geohess theorem, regret bounds, or deterministic/stochastic convergence proofs. Also use when the user mentions cosine-annealing from gradient geometry, norm-based adaptive learning, or geometric interpretability of SGD.

GeoAdaLer Skill

Reference guide for implementing and explaining the GeoAdaLer family of geometric adaptive optimizers. Read references/math.md for full proofs; read references/experiments.md for exact benchmark configurations.

Core Idea

Standard SGD anneals by the raw gradient magnitude, which can cause overshooting. GeoAdaLer replaces this with cos θ, where θ is the acute angle between the normal to the tangent hyperplane of the objective and the horizontal hyperplane. This gives a geometrically principled, logarithmically decaying annealing factor.

Key Theorem: Geohess (Theorem 3.1)

Let θ be the acute angle between the normal to the tangent hyperplane of f: Rⁿ → R (differentiable at x) and the horizontal hyperplane. Then:

cos θ = ‖∇f(x)‖ / √(‖∇f(x)‖² + 1)

Intuition: The normal to the tangent hyperplane at x is [-∇f(x), 1]ᵀ. The angle this makes with the horizontal direction [-∇f(x), 0]ᵀ encodes local curvature — small near an optimum (gradient ≈ 0), large far away (gradient large).

Annealing properties:

cos θ → 1 as ‖g‖ → ∞ (large steps far from optimum)
cos θ → 0 as ‖g‖ → 0 (small steps near optimum)
Decay is logarithmic in ‖g‖, vs. linear for vanilla SGD → more controllable

GeoAdaLer Update Rule

Deterministic (β = 0):

x_{t+1} = x_t - γ · (g_t / √(‖g_t‖² + 1))

Stochastic (with EMA momentum):

m_t = β·m_{t-1} + (1-β)·g_t          # exponential moving average of gradients
x_{t+1} = x_t - γ · (m_t / √(‖m_t‖² + 1))

Where γ is the learning rate, β ∈ [0,1) is the EMA decay, g_t = ∇f_t(x_t).

Note: The +1 stability term in the denominator arises naturally from the geometric construction (the normal vector has a 1 in its last component). It is not a manually tuned ε.

GeoAdaMax Update Rule

Addresses non-monotonic squared gradients (analogous to AMSGrad's fix for Adam) by using the running maximum of the denominator:

m_t = β·m_{t-1} + (1-β)·g_t
u_t = max(‖m_t‖² + 1,  u_{t-1})
x_{t+1} = x_t - γ · m_t / √u_t

Geometric interpretation: Using the max denominator is equivalent to increasing angle θ, producing more conservative (smaller) step sizes. Theorem 3.2 proves θ ≤ θ̂ where θ̂ is the angle corresponding to the max-norm denominator.

Relationship to AdaGrad Family

Optimizer	Denominator term `G_t`	Notes
AdaGrad	`Σ gᵢ²` (cumulative)	Monotonically decreasing LR
RMSProp	`β·G_{t-1} + (1-β)·g_t²`	Exponential decay on squared grads
Adam	Bias-corrected EMA of `g_t²`; separate `m_t`	Momentum + adaptive LR
GeoAdaLer	`‖m_t‖² + 1` (single EMA, norm-based)	Geometric; 1 is naturally derived

Key differences from Adam:

Single EMA m_t serves as both gradient estimate AND the scaling term (two roles, one tensor)
Stability term +1 is geometry-derived, not a tuned hyperparameter
By Jensen's inequality: ‖m_t‖² ≤ Adam's G_t — GeoAdaLer takes larger steps than Adam
Norm-based (not coordinate-wise) adaptivity → robust to hyperparameter choices

Convergence Summary

Deterministic (Theorem A.5 / §5.1): Let g(x) = ∇f(x) / √(‖∇f(x)‖² + 1) and T(x) = x - γg(x). If 0 < γ < 2/L_G, then T is nonexpansive and the iterates converge to a minimizer x* ∈ Fix(T) = arg min f. The best-iterate residual satisfies:

min_{0≤k≤N} ‖g(x_k)‖² ≤ ‖x₀ - x*‖² / [γ(2/L_G - γ)(N+1)]

Note: The SIAM version strengthens the arXiv result — it proves convergence via nonexpansiveness

Bolzano-Weierstrass rather than strict contraction, requiring only γ < 2/L_G (not γ ≤ 1/L).

Stochastic (Theorem 5.2): Achieves regret bound R(T) = O(√T) — optimal for general convex online learning. Full regret bound:

R(T) ≤ D²√(G²+1)·√T/(1-β) + G(2√T - 1)/(2(1-β)) + DGβ(1-λᵀ)/((1-β)(1-λ))

where D = ‖x_k - x*‖ bound, G = ‖∇f_t‖ bound, βt = β·λ^(t-1), λ ∈ (0,1).

Corollary 5.3: lim sup R(T)/T ≤ 0 — GeoAdaLer can outperform the best offline algorithm.

For full proofs, read references/math.md.

PyTorch Implementation

Both GeoAdaLer and GeoAdaMax are implemented in a single class via the geomax flag. Full source: https://github.com/Masuzyo/Geoadaler

import torch
from torch import Tensor

class Geoadaler(torch.optim.Optimizer):
    r"""Implements the GeoAdaLer optimization algorithm.
    Proposed in Eleh, Mwanza et al. (2025), arXiv:2405.16255

    Arguments:
        params:   iterable of parameters to optimize
        lr:       learning rate (default: 1e-3)
        beta:     EMA decay for gradient moving average (default: 0.9)
        beta2:    reserved parameter (default: 0.99, currently unused in step)
        eps:      stability term added to denominator; default 1 is geometry-derived
                  (corresponds to the +1 from the normal vector construction)
        geomax:   if True, uses running max of denominator → GeoAdaMax variant
    """
    def __init__(self, params, lr=1e-3, beta=0.9, beta2=0.99, eps=1, geomax=False):
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        if not 0.0 <= beta < 1.0:
            raise ValueError(f"Invalid beta: {beta}")
        if eps < 0.0:
            raise ValueError(f"Invalid epsilon: {eps}")
        defaults = dict(lr=lr, beta=beta, beta2=beta2, eps=eps, geomax=geomax)
        super(Geoadaler, self).__init__(params, defaults)

    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p]

                # Initialization: seed EMA with first gradient
                if len(state) == 0:
                    state['step'] = 0
                    state['grad_avg'] = grad
                    state['denom'] = grad.norm(p=2).pow(2).add(group['eps'])

                grad_avg = state['grad_avg']
                beta = group['beta']
                denom = state['denom']
                state['step'] += 1

                # EMA update: m_t = β·m_{t-1} + (1-β)·g_t
                grad_avg.mul_(beta).add_(grad, alpha=1 - beta)

                # Denominator: ‖m_t‖² + ε
                # GeoAdaMax: use running max (more conservative steps)
                # GeoAdaLer: recompute fresh each step
                if group['geomax']:
                    denom = denom.max(grad_avg.norm(p=2).pow(2).add(group['eps']))
                else:
                    denom = grad_avg.norm(p=2).pow(2).add(group['eps'])

                step_size = group['lr'] / denom.sqrt()

                # p = p - step_size * grad_avg  (element-wise via addcmul_)
                p.data.addcmul_(-step_size, grad_avg)

Usage:

# GeoAdaLer (default)
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.9)

# GeoAdaMax variant
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.9, geomax=True)

# Deterministic mode (no momentum)
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.0)

Hyperparameter Guide

Hyperparameter	Default	Notes
`lr` (γ)	1e-3	Safe range: 1e-4 to 1e-2; less sensitive than Adam
`beta` (β)	0.9	EMA decay; 0 = deterministic (no momentum)
`beta2`	0.99	Reserved; not used in current step logic
`eps`	1	Geometry-derived (+1 from normal vector); tune only for ε sensitivity experiments (Section 6.4)
`geomax`	False	Set True to enable GeoAdaMax (running max denominator)

Sensitivity note: GeoAdaLer is more robust to hyperparameter choice than Adam due to norm-based (rather than coordinate-wise) adaptivity (Ward et al., 2020).

Benchmark Results (from paper)

Dataset	GeoAdaLer	GeoAdaMax	Adam	AMSGrad	SGD
MNIST	0.9831	0.9831	0.9746	0.9809	0.9810
CIFAR-10	0.7982	0.7962	0.7679	0.7932	0.7957
Fashion MNIST	0.9044	0.9042	0.8838	0.8993	0.8969

Averaged over 30 random weight initializations per experiment.

For full experiment configurations (architectures, epochs, hardware), read references/experiments.md.

Implementation Notes

State initialization seeds from first gradient — Unlike typical PyTorch optimizers that initialize state with zeros_like, both grad_avg and denom are seeded from the first real gradient. This means the first step is effectively deterministic (no momentum dilution).

addcmul_ update — The parameter update uses p.data.addcmul_(-step_size, grad_avg). Since step_size is a scalar tensor (lr / √denom), this is equivalent to p -= step_size * grad_avg element-wise.

GeoAdaMax denom is a scalar — The running max operates on ‖m_t‖² + ε, a single scalar per parameter tensor, not a per-element tensor. This is consistent with norm-based adaptivity.

eps=1 is not a numerical stability hack — It is the geometric +1 term from the normal vector derivation. Do not replace with 1e-8. The ε sensitivity experiments (Section 6.4) show performance is relatively flat near eps=1.

Citation

@article{eleh2025geoadaler,
  title   = {GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms},
  author  = {Eleh, Chinedu and Mwanza, Masuzyo and Aguegboh, Ekene and van Wyk, Hans-Werner},
  journal = {arXiv preprint arXiv:2405.16255},
  year    = {2025},
  note    = {Under review (SIAM)}
}

Reproducibility: Skill File

Use this skill file to reproduce the research with an AI agent.

---
name: geoadaler
description: >
  Implement, explain, or benchmark the GeoAdaLer and GeoAdaMax optimization algorithms from the
  paper "GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms"
  (Eleh, Mwanza et al., 2025). Use this skill whenever a user asks to: implement GeoAdaLer or
  GeoAdaMax as a PyTorch or TensorFlow optimizer; explain the geometric intuition behind these
  methods; compare them to Adam, AMSGrad, RMSProp, or AdaGrad; reproduce or extend the paper's
  experiments on MNIST, CIFAR-10, or Fashion MNIST; or discuss the Geohess theorem, regret bounds,
  or deterministic/stochastic convergence proofs. Also use when the user mentions cosine-annealing
  from gradient geometry, norm-based adaptive learning, or geometric interpretability of SGD.
---

# GeoAdaLer Skill

Reference guide for implementing and explaining the GeoAdaLer family of geometric adaptive
optimizers. Read `references/math.md` for full proofs; read `references/experiments.md` for
exact benchmark configurations.

---

## Core Idea

Standard SGD anneals by the raw gradient magnitude, which can cause overshooting. GeoAdaLer
replaces this with `cos θ`, where `θ` is the acute angle between the **normal to the tangent
hyperplane** of the objective and the **horizontal hyperplane**. This gives a geometrically
principled, logarithmically decaying annealing factor.

---

## Key Theorem: Geohess (Theorem 3.1)

Let `θ` be the acute angle between the normal to the tangent hyperplane of `f: Rⁿ → R` (differentiable at `x`) and the horizontal hyperplane. Then:

```
cos θ = ‖∇f(x)‖ / √(‖∇f(x)‖² + 1)
```

**Intuition:** The normal to the tangent hyperplane at x is `[-∇f(x), 1]ᵀ`. The angle this
makes with the horizontal direction `[-∇f(x), 0]ᵀ` encodes local curvature — small near an
optimum (gradient ≈ 0), large far away (gradient large).

**Annealing properties:**
- `cos θ → 1` as `‖g‖ → ∞`  (large steps far from optimum)
- `cos θ → 0` as `‖g‖ → 0`  (small steps near optimum)
- Decay is **logarithmic** in `‖g‖`, vs. **linear** for vanilla SGD → more controllable

---

## GeoAdaLer Update Rule

**Deterministic (β = 0):**
```
x_{t+1} = x_t - γ · (g_t / √(‖g_t‖² + 1))
```

**Stochastic (with EMA momentum):**
```
m_t = β·m_{t-1} + (1-β)·g_t          # exponential moving average of gradients
x_{t+1} = x_t - γ · (m_t / √(‖m_t‖² + 1))
```

Where `γ` is the learning rate, `β ∈ [0,1)` is the EMA decay, `g_t = ∇f_t(x_t)`.

Note: The `+1` stability term in the denominator **arises naturally** from the geometric
construction (the normal vector has a `1` in its last component). It is not a manually tuned ε.

---

## GeoAdaMax Update Rule

Addresses non-monotonic squared gradients (analogous to AMSGrad's fix for Adam) by using the
running maximum of the denominator:

```
m_t = β·m_{t-1} + (1-β)·g_t
u_t = max(‖m_t‖² + 1,  u_{t-1})
x_{t+1} = x_t - γ · m_t / √u_t
```

**Geometric interpretation:** Using the max denominator is equivalent to increasing angle θ,
producing more conservative (smaller) step sizes. Theorem 3.2 proves `θ ≤ θ̂` where `θ̂` is the
angle corresponding to the max-norm denominator.

---

## Relationship to AdaGrad Family

| Optimizer   | Denominator term `G_t`                         | Notes                              |
|-------------|------------------------------------------------|------------------------------------|
| AdaGrad     | `Σ gᵢ²` (cumulative)                          | Monotonically decreasing LR        |
| RMSProp     | `β·G_{t-1} + (1-β)·g_t²`                     | Exponential decay on squared grads |
| Adam        | Bias-corrected EMA of `g_t²`; separate `m_t`  | Momentum + adaptive LR             |
| GeoAdaLer   | `‖m_t‖² + 1`  (single EMA, norm-based)        | Geometric; 1 is naturally derived  |

**Key differences from Adam:**
1. Single EMA `m_t` serves as both gradient estimate AND the scaling term (two roles, one tensor)
2. Stability term `+1` is geometry-derived, not a tuned hyperparameter
3. By Jensen's inequality: `‖m_t‖² ≤ Adam's G_t` — GeoAdaLer takes larger steps than Adam
4. Norm-based (not coordinate-wise) adaptivity → robust to hyperparameter choices

---

## Convergence Summary

**Deterministic (Theorem A.5 / §5.1):** Let `g(x) = ∇f(x) / √(‖∇f(x)‖² + 1)` and
`T(x) = x - γg(x)`. If `0 < γ < 2/L_G`, then `T` is **nonexpansive** and the iterates
converge to a minimizer `x* ∈ Fix(T) = arg min f`. The best-iterate residual satisfies:
```
min_{0≤k≤N} ‖g(x_k)‖² ≤ ‖x₀ - x*‖² / [γ(2/L_G - γ)(N+1)]
```
Note: The SIAM version strengthens the arXiv result — it proves convergence via nonexpansiveness
+ Bolzano-Weierstrass rather than strict contraction, requiring only `γ < 2/L_G` (not `γ ≤ 1/L`).

**Stochastic (Theorem 5.2):** Achieves regret bound `R(T) = O(√T)` — optimal for general
convex online learning. Full regret bound:
```
R(T) ≤ D²√(G²+1)·√T/(1-β) + G(2√T - 1)/(2(1-β)) + DGβ(1-λᵀ)/((1-β)(1-λ))
```
where `D = ‖x_k - x*‖` bound, `G = ‖∇f_t‖` bound, `βt = β·λ^(t-1)`, `λ ∈ (0,1)`.

**Corollary 5.3:** `lim sup R(T)/T ≤ 0` — GeoAdaLer can outperform the best offline algorithm.

For full proofs, read `references/math.md`.

---

## PyTorch Implementation

Both GeoAdaLer and GeoAdaMax are implemented in a **single class** via the `geomax` flag.
Full source: https://github.com/Masuzyo/Geoadaler

```python
import torch
from torch import Tensor

class Geoadaler(torch.optim.Optimizer):
    r"""Implements the GeoAdaLer optimization algorithm.
    Proposed in Eleh, Mwanza et al. (2025), arXiv:2405.16255

    Arguments:
        params:   iterable of parameters to optimize
        lr:       learning rate (default: 1e-3)
        beta:     EMA decay for gradient moving average (default: 0.9)
        beta2:    reserved parameter (default: 0.99, currently unused in step)
        eps:      stability term added to denominator; default 1 is geometry-derived
                  (corresponds to the +1 from the normal vector construction)
        geomax:   if True, uses running max of denominator → GeoAdaMax variant
    """
    def __init__(self, params, lr=1e-3, beta=0.9, beta2=0.99, eps=1, geomax=False):
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        if not 0.0 <= beta < 1.0:
            raise ValueError(f"Invalid beta: {beta}")
        if eps < 0.0:
            raise ValueError(f"Invalid epsilon: {eps}")
        defaults = dict(lr=lr, beta=beta, beta2=beta2, eps=eps, geomax=geomax)
        super(Geoadaler, self).__init__(params, defaults)

    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p]

                # Initialization: seed EMA with first gradient
                if len(state) == 0:
                    state['step'] = 0
                    state['grad_avg'] = grad
                    state['denom'] = grad.norm(p=2).pow(2).add(group['eps'])

                grad_avg = state['grad_avg']
                beta = group['beta']
                denom = state['denom']
                state['step'] += 1

                # EMA update: m_t = β·m_{t-1} + (1-β)·g_t
                grad_avg.mul_(beta).add_(grad, alpha=1 - beta)

                # Denominator: ‖m_t‖² + ε
                # GeoAdaMax: use running max (more conservative steps)
                # GeoAdaLer: recompute fresh each step
                if group['geomax']:
                    denom = denom.max(grad_avg.norm(p=2).pow(2).add(group['eps']))
                else:
                    denom = grad_avg.norm(p=2).pow(2).add(group['eps'])

                step_size = group['lr'] / denom.sqrt()

                # p = p - step_size * grad_avg  (element-wise via addcmul_)
                p.data.addcmul_(-step_size, grad_avg)
```

**Usage:**
```python
# GeoAdaLer (default)
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.9)

# GeoAdaMax variant
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.9, geomax=True)

# Deterministic mode (no momentum)
optimizer = Geoadaler(model.parameters(), lr=1e-3, beta=0.0)
```

---

## Hyperparameter Guide

| Hyperparameter | Default | Notes                                                          |
|----------------|---------|----------------------------------------------------------------|
| `lr` (γ)       | 1e-3    | Safe range: 1e-4 to 1e-2; less sensitive than Adam            |
| `beta` (β)     | 0.9     | EMA decay; 0 = deterministic (no momentum)                     |
| `beta2`        | 0.99    | Reserved; not used in current step logic                       |
| `eps`          | 1       | Geometry-derived (+1 from normal vector); tune only for ε sensitivity experiments (Section 6.4) |
| `geomax`       | False   | Set True to enable GeoAdaMax (running max denominator)         |

**Sensitivity note:** GeoAdaLer is more robust to hyperparameter choice than Adam due to
norm-based (rather than coordinate-wise) adaptivity (Ward et al., 2020).

---

## Benchmark Results (from paper)

| Dataset       | GeoAdaLer | GeoAdaMax | Adam   | AMSGrad | SGD    |
|---------------|-----------|-----------|--------|---------|--------|
| MNIST         | **0.9831**| **0.9831**| 0.9746 | 0.9809  | 0.9810 |
| CIFAR-10      | **0.7982**| 0.7962    | 0.7679 | 0.7932  | 0.7957 |
| Fashion MNIST | **0.9044**| 0.9042    | 0.8838 | 0.8993  | 0.8969 |

Averaged over 30 random weight initializations per experiment.

For full experiment configurations (architectures, epochs, hardware), read `references/experiments.md`.

---

## Implementation Notes

**State initialization seeds from first gradient** — Unlike typical PyTorch optimizers that
initialize state with `zeros_like`, both `grad_avg` and `denom` are seeded from the first
real gradient. This means the first step is effectively deterministic (no momentum dilution).

**`addcmul_` update** — The parameter update uses `p.data.addcmul_(-step_size, grad_avg)`.
Since `step_size` is a scalar tensor (`lr / √denom`), this is equivalent to
`p -= step_size * grad_avg` element-wise.

**GeoAdaMax `denom` is a scalar** — The running max operates on `‖m_t‖² + ε`, a single scalar
per parameter tensor, not a per-element tensor. This is consistent with norm-based adaptivity.

**`eps=1` is not a numerical stability hack** — It is the geometric `+1` term from the normal
vector derivation. Do not replace with 1e-8. The ε sensitivity experiments (Section 6.4) show
performance is relatively flat near `eps=1`.

---

## Citation

```bibtex
@article{eleh2025geoadaler,
  title   = {GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms},
  author  = {Eleh, Chinedu and Mwanza, Masuzyo and Aguegboh, Ekene and van Wyk, Hans-Werner},
  journal = {arXiv preprint arXiv:2405.16255},
  year    = {2025},
  note    = {Under review (SIAM)}
}
```

Code: https://github.com/Masuzyo/Geoadaler

Discussion (0)

to join the discussion.

No comments yet. Be the first to discuss this paper.