Limits & Continuity · foundational · 40 min read

Sequences, Limits & Convergence

The rigorous foundation of calculus — making 'arbitrarily close' precise with the ε-N definition, and the convergence theorems that guarantee your algorithms actually converge

Abstract. A sequence in ℝ is a function a: ℕ → ℝ. We say (aₙ) converges to L if for every ε > 0, there exists N ∈ ℕ such that n ≥ N implies |aₙ - L| < ε. This definition — the ε-N definition — is where calculus becomes rigorous. The Monotone Convergence Theorem guarantees that every bounded monotone sequence converges, the Squeeze Theorem lets us establish limits by trapping a sequence between two convergent bounds, and the Algebra of Limits lets us decompose complex limits into simpler pieces. The Bolzano-Weierstrass Theorem — every bounded sequence has a convergent subsequence — is the compactness result that underpins the existence of minimizers in optimization. Cauchy sequences provide a criterion for convergence that does not require knowing the limit in advance, and the completeness of ℝ (every Cauchy sequence converges) is the structural property that makes calculus work. For machine learning, gradient descent θₜ₊₁ = θₜ - η∇f(θₜ) defines a sequence whose convergence is the entire point; the rate of convergence — sublinear O(1/n) for SGD, linear O(rⁿ) for GD on strongly convex functions, quadratic for Newton's method — determines practical algorithm selection. The Robbins-Monro conditions on learning rate schedules (Σηₜ = ∞, Σηₜ² < ∞) are statements about series convergence that we formalize here.

Where this leads → formalML

  • formalML Gradient descent defines a sequence θₜ in ℝᵈ whose convergence analysis — rates, conditions, step-size requirements — rests directly on the sequence convergence theory developed here.
  • formalML The convergence of empirical risk R̂ₙ(h) → R(h) as n → ∞ is a sequence convergence result. Uniform convergence of empirical processes (Glivenko-Cantelli) extends this to function classes.
  • formalML MCMC sampling produces a sequence of states whose convergence to the target distribution is characterized by mixing time — how many steps until the sequence is 'close enough.'
  • formalML Concentration inequalities quantify convergence rates for sequences of random variables — how fast empirical averages converge to expectations.
  • formalML Modes of convergence in probability (a.s., in probability, in distribution, in Lᵖ) generalize the sequence convergence concepts introduced here to random variable sequences.

Overview & Motivation

You’ve watched a training loss curve drop toward zero and level off. You’ve seen gradient descent “converge.” You’ve tuned a learning rate schedule until the optimizer “settles down.” But what do any of these phrases actually mean?

When we say a sequence of iterates θ1,θ2,θ3,\theta_1, \theta_2, \theta_3, \ldots converges, we’re making a precise mathematical claim: no matter how tight a tolerance you specify — a billionth, a trillionth, any positive number at all — there is some point in the sequence past which every single term stays within that tolerance of the limit. This is the ε\varepsilon-NN definition, and it is where calculus becomes rigorous.

In this topic, we build the complete machinery of sequence convergence from scratch. We start with what a sequence is, move to the ε\varepsilon-NN definition that makes “arbitrarily close” precise, prove the core convergence theorems that let us establish limits without computing them directly, and develop the Cauchy criterion that lets us detect convergence without even knowing the limit. Every theorem here has a direct application to machine learning — gradient descent convergence, optimizer guarantees, and the learning rate conditions that make stochastic methods work.

Sequences in ℝ

We begin with the most basic object in analysis: a sequence.

📐 Definition 1 (Sequence)

A sequence in R\mathbb{R} is a function a:NRa: \mathbb{N} \to \mathbb{R}. We write (an)n=1(a_n)_{n=1}^{\infty} or simply (an)(a_n) for the sequence, where an=a(n)a_n = a(n) denotes the nn-th term.

The notation is important: (an)(a_n) with parentheses denotes the entire sequence as an object, while ana_n without parentheses denotes a single term. We’ll be careful about this distinction throughout.

Here are four sequences that appear constantly in machine learning:

  • an=1/na_n = 1/n — the simplest decaying sequence. Learning rate schedules like ηt=c/t\eta_t = c/t produce sublinear convergence rates for SGD.
  • an=(1+1/n)na_n = (1 + 1/n)^n — converges to Euler’s number e2.718e \approx 2.718. The natural exponential shows up in softmax, cross-entropy loss, and the exponential family of distributions.
  • an=0.9na_n = 0.9^n — geometric (exponential) decay. This is the momentum coefficient in Adam (β1=0.9\beta_1 = 0.9), the discount factor in reinforcement learning, and the convergence rate of gradient descent on strongly convex functions.
  • an=(1)n/na_n = (-1)^n / n — oscillates but converges to 00. SGD with noisy gradients exhibits exactly this kind of oscillating convergence.

Learning rate schedule: η_t = c/t gives O(1/√t) SGD convergence

📐 Definition 2 (Bounded Sequence)

A sequence (an)(a_n) is bounded if there exists M>0M > 0 such that anM|a_n| \leq M for all nNn \in \mathbb{N}. Equivalently, the range {an:nN}\{a_n : n \in \mathbb{N}\} is a bounded subset of R\mathbb{R}.

📐 Definition 3 (Monotone Sequence)

A sequence (an)(a_n) is increasing (or non-decreasing) if anan+1a_n \leq a_{n+1} for all nn. It is decreasing (or non-increasing) if anan+1a_n \geq a_{n+1} for all nn. A sequence that is either increasing or decreasing is called monotone.

Among our examples: (1/n)(1/n) is decreasing, (1+1/n)n(1 + 1/n)^n is increasing (a non-trivial fact we’ll use later), (0.9n)(0.9^n) is decreasing, and ((1)n/n)((-1)^n/n) is neither increasing nor decreasing — it oscillates. Monotonicity is a powerful structural property: as we’ll prove shortly, any bounded monotone sequence must converge.

The ε-N Definition of Convergence

We now make the informal idea of “getting arbitrarily close” fully rigorous. This is the definition that separates calculus from hand-waving.

Building the intuition

Consider the sequence an=1/na_n = 1/n. Intuitively, the terms 1,1/2,1/3,1/4,1, 1/2, 1/3, 1/4, \ldots are “approaching 00.” But what does “approaching” mean precisely?

Here’s the key idea: pick any tolerance — say ε=0.1\varepsilon = 0.1. Then past n=10n = 10, every term satisfies an0=1/n<0.1|a_n - 0| = 1/n < 0.1. Pick a tighter tolerance, ε=0.01\varepsilon = 0.01. Then past n=100n = 100, every term satisfies an0<0.01|a_n - 0| < 0.01. No matter how small a positive ε\varepsilon you choose, there is always some index NN past which all terms lie within ε\varepsilon of 00.

The formal definition

📐 Definition 4 (Convergence of a Sequence)

A sequence (an)(a_n) converges to LRL \in \mathbb{R} if for every ε>0\varepsilon > 0, there exists NNN \in \mathbb{N} such that

nN    anL<ε.n \geq N \implies |a_n - L| < \varepsilon.

We write limnan=L\lim_{n \to \infty} a_n = L, or (an)L(a_n) \to L, or anLa_n \to L as nn \to \infty.

A sequence that does not converge is called divergent.

Read it carefully: the quantifier structure is ε>0, NN: nNanL<ε\forall \varepsilon > 0,\ \exists N \in \mathbb{N}:\ n \geq N \Rightarrow |a_n - L| < \varepsilon. The order matters: ε\varepsilon is chosen first (by an adversary, if you like), and then we must produce an NN that works for that ε\varepsilon. Different values of ε\varepsilon may require different values of NN — and smaller ε\varepsilon generally requires larger NN.

The ε-N definition visualized for three values of ε

Inside band (n ≥ N) Outside band Inside band (n < N)N = 11 for ε = 0.100

A complete proof from the definition

📝 Example 1 (Proof that lim 1/n = 0)

Claim: limn1/n=0\lim_{n \to \infty} 1/n = 0.

Proof. Let ε>0\varepsilon > 0 be given. We need to find NNN \in \mathbb{N} such that nNn \geq N implies 1/n0<ε|1/n - 0| < \varepsilon.

Since 1/n0=1/n|1/n - 0| = 1/n, the condition 1/n<ε1/n < \varepsilon is equivalent to n>1/εn > 1/\varepsilon. By the Archimedean property of R\mathbb{R}, there exists NNN \in \mathbb{N} with N>1/εN > 1/\varepsilon.

Then for all nNn \geq N:

an0=1n1N<ε.|a_n - 0| = \frac{1}{n} \leq \frac{1}{N} < \varepsilon.

Since ε>0\varepsilon > 0 was arbitrary, we have limn1/n=0\lim_{n \to \infty} 1/n = 0. \square

Notice the structure: let ε>0\varepsilon > 0 be given, choose NN in terms of ε\varepsilon, verify the bound. Every ε\varepsilon-NN proof follows this template. The creative part is choosing NN — often this requires algebraic manipulation of the target inequality anL<ε|a_n - L| < \varepsilon to isolate nn.

Two fundamental properties of convergent sequences

🔷 Proposition 1 (Uniqueness of Limits)

If (an)(a_n) converges, the limit is unique. That is, if anLa_n \to L and anMa_n \to M, then L=ML = M.

Proof.

Suppose anLa_n \to L and anMa_n \to M. Let ε>0\varepsilon > 0 be given. Since anLa_n \to L, there exists N1N_1 such that nN1n \geq N_1 implies anL<ε/2|a_n - L| < \varepsilon/2. Since anMa_n \to M, there exists N2N_2 such that nN2n \geq N_2 implies anM<ε/2|a_n - M| < \varepsilon/2.

Let N=max(N1,N2)N = \max(N_1, N_2). For any nNn \geq N, by the triangle inequality:

LM=Lan+anManL+anM<ε2+ε2=ε.|L - M| = |L - a_n + a_n - M| \leq |a_n - L| + |a_n - M| < \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon.

Since LM<ε|L - M| < \varepsilon for every ε>0\varepsilon > 0, and LM0|L - M| \geq 0 is a fixed non-negative number, we must have LM=0|L - M| = 0, i.e., L=ML = M.

This proof uses a technique you’ll see throughout analysis: to show a non-negative quantity equals zero, show it’s less than every positive number. The triangle inequality x+yx+y|x + y| \leq |x| + |y| is the workhorse — it lets us split a distance into two pieces we can control separately.

🔷 Proposition 2 (Convergent Implies Bounded)

Every convergent sequence is bounded.

Proof.

Suppose anLa_n \to L. Apply the definition with ε=1\varepsilon = 1: there exists NN such that nNn \geq N implies anL<1|a_n - L| < 1, which gives an<L+1|a_n| < |L| + 1 for all nNn \geq N.

Set M=max(a1,a2,,aN1,L+1)M = \max(|a_1|, |a_2|, \ldots, |a_{N-1}|, |L| + 1). Then anM|a_n| \leq M for all nNn \in \mathbb{N}.

The converse is false — the sequence (1)n(-1)^n is bounded (by M=1M = 1) but not convergent. Boundedness is necessary for convergence, but not sufficient. We need additional structure (like monotonicity) to guarantee convergence.

Convergence Theorems

With the definition established, we now prove three theorems that are the main tools for establishing that a sequence converges. In practice, very few limits are proved directly from the ε\varepsilon-NN definition — instead, we use these theorems.

🔷 Theorem 1 (Monotone Convergence Theorem)

Every bounded, monotone sequence in R\mathbb{R} converges.

More precisely: if (an)(a_n) is increasing and bounded above, then limnan=sup{an:nN}\lim_{n \to \infty} a_n = \sup\{a_n : n \in \mathbb{N}\}. If (an)(a_n) is decreasing and bounded below, then limnan=inf{an:nN}\lim_{n \to \infty} a_n = \inf\{a_n : n \in \mathbb{N}\}.

Proof.

We prove the increasing case; the decreasing case follows by considering (an)(-a_n).

Let (an)(a_n) be increasing and bounded above. The set S={an:nN}S = \{a_n : n \in \mathbb{N}\} is non-empty and bounded above, so by the completeness axiom (least upper bound property of R\mathbb{R}), s=supSs = \sup S exists.

We claim ansa_n \to s. Let ε>0\varepsilon > 0. Since ss is the least upper bound of SS, the number sεs - \varepsilon is not an upper bound. Therefore, there exists NNN \in \mathbb{N} such that aN>sεa_N > s - \varepsilon.

Since (an)(a_n) is increasing, for all nNn \geq N:

sε<aNans<s+ε.s - \varepsilon < a_N \leq a_n \leq s < s + \varepsilon.

The first inequality uses our choice of NN, the second uses nNn \geq N with monotonicity, the third uses s=supSs = \sup S. Together: ans<ε|a_n - s| < \varepsilon for all nNn \geq N.

This proof reveals something deep: the Monotone Convergence Theorem is really a theorem about the completeness of R\mathbb{R}. In the rational numbers Q\mathbb{Q}, the sequence 1,1.4,1.41,1.414,1, 1.4, 1.41, 1.414, \ldots (decimal approximations to 2\sqrt{2}) is increasing and bounded above, but it does not converge in Q\mathbb{Q} — the limit 2\sqrt{2} is irrational. The real numbers were constructed to fill these gaps.

Why this matters for ML: Most gradient descent convergence proofs follow this pattern. If you can show that the loss sequence f(θ1),f(θ2),f(\theta_1), f(\theta_2), \ldots is decreasing (which requires a sufficiently small step size) and bounded below (e.g., f0f \geq 0), the Monotone Convergence Theorem guarantees that the loss converges to some value. Whether that value is the global minimum is a separate question.

🔷 Theorem 2 (Squeeze Theorem)

Let (an)(a_n), (bn)(b_n), and (cn)(c_n) be sequences satisfying anbncna_n \leq b_n \leq c_n for all nn (or for all nn past some index). If limnan=limncn=L\lim_{n \to \infty} a_n = \lim_{n \to \infty} c_n = L, then limnbn=L\lim_{n \to \infty} b_n = L.

Proof.

Let ε>0\varepsilon > 0. Since anLa_n \to L, there exists N1N_1 such that nN1n \geq N_1 implies anL<ε|a_n - L| < \varepsilon, i.e., Lε<anL - \varepsilon < a_n. Since cnLc_n \to L, there exists N2N_2 such that nN2n \geq N_2 implies cnL<ε|c_n - L| < \varepsilon, i.e., cn<L+εc_n < L + \varepsilon.

Let N=max(N1,N2)N = \max(N_1, N_2). For nNn \geq N:

Lε<anbncn<L+ε,L - \varepsilon < a_n \leq b_n \leq c_n < L + \varepsilon,

so bnL<ε|b_n - L| < \varepsilon.

The Squeeze Theorem is how we prove convergence rates: if we can trap the error anL|a_n - L| between zero and a sequence that decays at a known rate, the Squeeze Theorem tells us anL|a_n - L| decays at least that fast.

🔷 Theorem 3 (Algebra of Limits)

If anAa_n \to A and bnBb_n \to B, then:

  1. (an+bn)A+B(a_n + b_n) \to A + B
  2. (anbn)AB(a_n \cdot b_n) \to A \cdot B
  3. (an/bn)A/B(a_n / b_n) \to A/B, provided B0B \neq 0
  4. (can)cA(c \cdot a_n) \to c \cdot A for any constant cRc \in \mathbb{R}

Proof.

We prove part (2), the product rule, which is the most instructive case.

Since anAa_n \to A, the sequence (an)(a_n) is bounded (Proposition 2): there exists M>0M > 0 with anM|a_n| \leq M for all nn.

Let ε>0\varepsilon > 0. Choose N1N_1 such that nN1n \geq N_1 implies anA<ε/(2B+1)|a_n - A| < \varepsilon/(2|B| + 1). Choose N2N_2 such that nN2n \geq N_2 implies bnB<ε/(2M)|b_n - B| < \varepsilon/(2M).

For nmax(N1,N2)n \geq \max(N_1, N_2):

anbnAB=anbnanB+anBABanbnB+BanA|a_n b_n - AB| = |a_n b_n - a_n B + a_n B - AB| \leq |a_n||b_n - B| + |B||a_n - A|

<Mε2M+Bε2B+1<ε2+ε2=ε.< M \cdot \frac{\varepsilon}{2M} + |B| \cdot \frac{\varepsilon}{2|B| + 1} < \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon.

The key trick: we added and subtracted anBa_n B to create two pieces — one controlled by bnB|b_n - B| (using the bound on ana_n) and one controlled by anA|a_n - A| (using the known value of BB).

💡 Remark 1

These three theorems are exactly the tools that gradient descent convergence proofs use. Monotone Convergence handles the “loss is decreasing and bounded below” argument. Squeeze establishes convergence rates by bounding the error between known decay rates. Algebra of Limits lets us decompose complex update rules into simpler pieces whose limits we already know.

Convergence theorems visualized

Subsequences & Bolzano-Weierstrass

Sometimes a sequence doesn’t converge, but parts of it do. This observation leads to one of the most important theorems in analysis.

📐 Definition 5 (Subsequence)

A subsequence of (an)(a_n) is a sequence (ank)k=1(a_{n_k})_{k=1}^{\infty} where n1<n2<n3<n_1 < n_2 < n_3 < \cdots is a strictly increasing sequence of natural numbers.

For example, the even-indexed terms a2,a4,a6,a_2, a_4, a_6, \ldots form a subsequence. So do the terms at prime indices a2,a3,a5,a7,a11,a_2, a_3, a_5, a_7, a_{11}, \ldots. The key constraint is that the indices nkn_k must be strictly increasing — we must move forward through the original sequence.

🔷 Proposition 3 (Subsequences Inherit Limits)

If (an)L(a_n) \to L, then every subsequence (ank)L(a_{n_k}) \to L.

Proof.

Let ε>0\varepsilon > 0. Since anLa_n \to L, there exists NN such that nNn \geq N implies anL<ε|a_n - L| < \varepsilon.

Since n1<n2<n3<n_1 < n_2 < n_3 < \cdots is strictly increasing with nkNn_k \in \mathbb{N}, a simple induction shows nkkn_k \geq k for all kk. Therefore, for kNk \geq N, we have nkkNn_k \geq k \geq N, so ankL<ε|a_{n_k} - L| < \varepsilon.

The contrapositive gives a powerful divergence test: if two subsequences converge to different limits, the original sequence diverges. For (1)n(-1)^n, the even terms (a2k)=(1,1,1,)1(a_{2k}) = (1, 1, 1, \ldots) \to 1 and the odd terms (a2k+1)=(1,1,1,)1(a_{2k+1}) = (-1, -1, -1, \ldots) \to -1. Since the subsequential limits disagree, (1)n(-1)^n diverges.

🔷 Theorem 4 (Bolzano-Weierstrass Theorem)

Every bounded sequence in R\mathbb{R} has a convergent subsequence.

Proof.

Let (an)(a_n) be bounded, so an[c,d]a_n \in [c, d] for some interval [c,d][c, d]. We construct a convergent subsequence by repeated bisection.

Step 1. Bisect [c,d][c, d] into [c,m][c, m] and [m,d][m, d] where m=(c+d)/2m = (c + d)/2. At least one half contains infinitely many terms of (an)(a_n) (if both halves contained only finitely many, the total would be finite, contradicting the fact that the sequence has infinitely many terms). Choose the half [c1,d1][c_1, d_1] containing infinitely many terms. Pick n1n_1 such that an1[c1,d1]a_{n_1} \in [c_1, d_1].

Step 2. Bisect [c1,d1][c_1, d_1] into two halves. Again, at least one half contains infinitely many terms of (an)(a_n). Choose that half [c2,d2][c_2, d_2] and pick n2>n1n_2 > n_1 such that an2[c2,d2]a_{n_2} \in [c_2, d_2].

Continuing inductively: At step kk, we have an interval [ck,dk][c_k, d_k] of length (dc)/2k(d - c)/2^k containing infinitely many terms, and an index nkn_k with ank[ck,dk]a_{n_k} \in [c_k, d_k].

Convergence. The nested intervals [ck,dk][c_k, d_k] satisfy c1c2c_1 \leq c_2 \leq \cdots (increasing left endpoints) and d1d2d_1 \geq d_2 \geq \cdots (decreasing right endpoints), and dkck=(dc)/2k0d_k - c_k = (d - c)/2^k \to 0. By the Monotone Convergence Theorem, both (ck)(c_k) and (dk)(d_k) converge, and they must converge to the same limit LL (since their difference tends to 00).

Since ckankdkc_k \leq a_{n_k} \leq d_k for all kk, the Squeeze Theorem gives ankLa_{n_k} \to L.

Bolzano-Weierstrass: extracting a convergent subsequence by bisection

💡 Remark 2

Bolzano-Weierstrass is the one-dimensional version of the compactness argument that guarantees the existence of minimizers in optimization. Here’s the connection: if f:KRf: K \to \mathbb{R} is continuous and KRdK \subset \mathbb{R}^d is closed and bounded (compact), then ff attains its minimum on KK. The proof works by taking a sequence (xn)(\mathbf{x}_n) in KK with f(xn)infKff(\mathbf{x}_n) \to \inf_K f, extracting a convergent subsequence by Bolzano-Weierstrass (generalized to Rd\mathbb{R}^d), and using continuity to show the subsequential limit is a minimizer.

In machine learning, this is why we often restrict the parameter space (weight decay, norm constraints, compact domains) — compactness guarantees that minimizers exist.

Cauchy Sequences & Completeness

The ε\varepsilon-NN definition requires us to know the limit LL in advance. But in practice, we often want to prove convergence without knowing the limit — for instance, we want to show that gradient descent converges to some critical point without identifying which one. The Cauchy criterion solves this problem.

📐 Definition 6 (Cauchy Sequence)

A sequence (an)(a_n) is a Cauchy sequence if for every ε>0\varepsilon > 0, there exists NNN \in \mathbb{N} such that

m,nN    aman<ε.m, n \geq N \implies |a_m - a_n| < \varepsilon.

In words: the terms of the sequence get arbitrarily close to each other, not just to some external limit.

Compare with the convergence definition: convergence says terms get close to LL; Cauchy says terms get close to each other. The Cauchy condition is intrinsic — it refers only to the sequence itself.

🔷 Proposition 4 (Convergent Implies Cauchy)

Every convergent sequence is Cauchy.

Proof.

Suppose anLa_n \to L. Let ε>0\varepsilon > 0. There exists NN such that nNn \geq N implies anL<ε/2|a_n - L| < \varepsilon/2. For m,nNm, n \geq N:

aman=amL+LanamL+anL<ε2+ε2=ε.|a_m - a_n| = |a_m - L + L - a_n| \leq |a_m - L| + |a_n - L| < \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon.

The deep question is the converse: is every Cauchy sequence convergent? In Q\mathbb{Q}, the answer is no — the decimal approximations to 2\sqrt{2} form a Cauchy sequence in Q\mathbb{Q} that does not converge in Q\mathbb{Q}. But in R\mathbb{R}, the answer is yes, and this is the whole point of the real number system.

🔷 Theorem 5 (Completeness of ℝ)

Every Cauchy sequence in R\mathbb{R} converges. Equivalently, R\mathbb{R} is a complete metric space.

Proof.

Let (an)(a_n) be Cauchy in R\mathbb{R}.

Step 1: (an)(a_n) is bounded. Apply the Cauchy condition with ε=1\varepsilon = 1: there exists NN such that m,nNm, n \geq N implies aman<1|a_m - a_n| < 1. In particular, an<aN+1|a_n| < |a_N| + 1 for all nNn \geq N. Setting M=max(a1,,aN1,aN+1)M = \max(|a_1|, \ldots, |a_{N-1}|, |a_N| + 1), we get anM|a_n| \leq M for all nn.

Step 2: Extract a convergent subsequence. Since (an)(a_n) is bounded, Bolzano-Weierstrass gives a subsequence (ank)(a_{n_k}) with ankLa_{n_k} \to L for some LRL \in \mathbb{R}.

Step 3: The full sequence converges to LL. Let ε>0\varepsilon > 0. Since (an)(a_n) is Cauchy, there exists N1N_1 such that m,nN1m, n \geq N_1 implies aman<ε/2|a_m - a_n| < \varepsilon/2. Since ankLa_{n_k} \to L, there exists KK such that kKk \geq K implies ankL<ε/2|a_{n_k} - L| < \varepsilon/2.

Choose k0Kk_0 \geq K with nk0N1n_{k_0} \geq N_1 (possible since nkn_k \to \infty). For any nN1n \geq N_1:

anLanank0+ank0L<ε2+ε2=ε.|a_n - L| \leq |a_n - a_{n_{k_0}}| + |a_{n_{k_0}} - L| < \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon.

The first term is controlled by the Cauchy condition (both nn and nk0n_{k_0} are N1\geq N_1), and the second by the subsequential convergence.

Cauchy sequences: convergent (Σ1/k²) vs non-convergent (harmonic series)

max |am − an| for m,n ≥ 20: 0.0320

max |am − an| for m,n ≥ 20: 1.0655

💡 Remark 3

The Cauchy criterion is the tool of choice when you don’t know the limit. In optimization, we often prove convergence by showing that successive iterates satisfy θt+1θt0\|\theta_{t+1} - \theta_t\| \to 0 — this is close to (though not identical to) the Cauchy condition. The completeness of R\mathbb{R} (or Rd\mathbb{R}^d) then guarantees that the iterates converge to some point, even though we may not have a closed-form expression for it.

This is also why completeness matters for function spaces in machine learning. When we optimize over a function class F\mathcal{F}, the class needs to be complete (or at least have compact sublevel sets) to guarantee that minimizing sequences converge to actual minimizers within F\mathcal{F}, not to some function outside it.

Rates of Convergence

Knowing that a sequence converges is important, but often not enough. In practice, we need to know how fast it converges — the difference between O(1/n)O(1/n) and O(rn)O(r^n) convergence is the difference between an algorithm that takes hours and one that takes seconds.

📐 Definition 7 (Rates of Convergence)

Let (an)L(a_n) \to L with errors en=anLe_n = |a_n - L|. The rate of convergence is classified as:

  • Sublinear: en=O(1/np)e_n = O(1/n^p) for some p>0p > 0 (errors decay polynomially).
  • Linear: en+1ren|e_{n+1}| \leq r \cdot |e_n| for some 0<r<10 < r < 1 (errors decay by a constant factor each step).
  • Superlinear: en+1/en0|e_{n+1}| / |e_n| \to 0 (the ratio of successive errors tends to zero).
  • Quadratic: en+1Cen2|e_{n+1}| \leq C |e_n|^2 for some C>0C > 0 (errors are squared each step — extremely fast).

These rates form a strict hierarchy: quadratic \Rightarrow superlinear \Rightarrow linear \Rightarrow sublinear (in terms of speed). The practical implications are dramatic:

📝 Example 2 (SGD on Convex Functions — Sublinear Rate)

Stochastic gradient descent on a convex function with step size ηt=c/t\eta_t = c/\sqrt{t} achieves E[f(θˉT)]f=O(1/T)\mathbb{E}[f(\bar{\theta}_T)] - f^* = O(1/\sqrt{T}). This is sublinear: halving the error requires quadrupling the number of iterations. For SGD to reach ε\varepsilon-accuracy, we need T=O(1/ε2)T = O(1/\varepsilon^2) steps.

📝 Example 3 (GD on Strongly Convex Functions — Linear Rate)

Gradient descent on a μ\mu-strongly convex, LL-smooth function with step size η=2/(μ+L)\eta = 2/(\mu + L) achieves f(θt)frt(f(θ0)f)f(\theta_t) - f^* \leq r^t (f(\theta_0) - f^*) where r=(Lμ)/(L+μ)<1r = (L - \mu)/(L + \mu) < 1. This is linear (also called “geometric” or “exponential”) convergence: each step reduces the error by the constant factor rr. The condition number κ=L/μ\kappa = L/\mu determines how fast — r=(κ1)/(κ+1)r = (\kappa - 1)/(\kappa + 1), so well-conditioned problems (κ\kappa small) converge faster.

📝 Example 4 (Newton's Method — Quadratic Rate)

Newton’s method for solving f(x)=0f'(x) = 0 near a root achieves xt+1xCxtx2|x_{t+1} - x^*| \leq C |x_t - x^*|^2 once the iterates are sufficiently close to xx^*. This is quadratic: if xtx=103|x_t - x^*| = 10^{-3}, then xt+1x106|x_{t+1} - x^*| \approx 10^{-6}, and xt+2x1012|x_{t+2} - x^*| \approx 10^{-12}. The number of correct digits doubles each step.

Convergence rates on a log scale

Connections to ML

Everything we’ve developed — the ε\varepsilon-NN definition, the convergence theorems, the Cauchy criterion, convergence rates — is the mathematical engine behind optimization and learning theory in machine learning.

Gradient descent as a convergent sequence

Consider minimizing f(x)=x2f(x) = x^2. Gradient descent with step size η\eta produces the update xt+1=xtη2xt=(12η)xtx_{t+1} = x_t - \eta \cdot 2x_t = (1 - 2\eta) x_t. This is a geometric sequence:

xt=(12η)tx0.x_t = (1 - 2\eta)^t \cdot x_0.

For 0<η<10 < \eta < 1, we have 12η<1|1 - 2\eta| < 1, so xt0x_t \to 0 by the geometric series argument — the errors decay linearly with rate r=12ηr = |1 - 2\eta|. For η=1\eta = 1 or η>1\eta > 1, the sequence diverges. Try it below:

Step: 50xₜ = 0.0000|xₜ| = 0.0000Converges (r = 0.40)

Gradient descent on f(x) = x² for three learning rates

This is why the learning rate matters: too small and convergence is glacially slow (rr close to 11), too large and the sequence diverges, and there’s an optimal η\eta that minimizes rr. For f(x)=x2f(x) = x^2, the optimal is η=1/2\eta = 1/2, giving r=0r = 0 — convergence in one step.

Learning rate schedules & Robbins-Monro

For stochastic gradient descent, a fixed learning rate doesn’t work — the noise prevents convergence to a point. Instead, we need a decaying schedule η1,η2,η3,\eta_1, \eta_2, \eta_3, \ldots satisfying the Robbins-Monro conditions:

t=1ηt=andt=1ηt2<.\sum_{t=1}^{\infty} \eta_t = \infty \quad \text{and} \quad \sum_{t=1}^{\infty} \eta_t^2 < \infty.

The first condition (ηt=\sum \eta_t = \infty) ensures the steps are large enough to reach the optimum from any starting point. The second (ηt2<\sum \eta_t^2 < \infty) ensures the accumulated noise variance is finite, so the iterates settle down. Common schedules satisfying both: ηt=c/t\eta_t = c/t and ηt=c/t\eta_t = c/\sqrt{t}.

These are series convergence conditions — statements about infinite sums. We’ll formalize series in Series Convergence & Tests (coming soon), but the key point is that learning rate theory rests on the sequence and series machinery we’ve built here.

Learning rate schedules with Robbins-Monro conditions

MCMC mixing

Markov chain Monte Carlo (MCMC) sampling produces a sequence of states X1,X2,X3,X_1, X_2, X_3, \ldots that converges in distribution to a target distribution π\pi. The mixing time — how many steps until the chain is “ε\varepsilon-close” to π\pi — is a convergence rate question. Fast-mixing chains have geometric (linear) convergence, while slowly-mixing chains may need O(n2)O(n^2) or worse steps. The formal framework of sequence convergence underlies the entire theory. See formalML for the full treatment.

Empirical risk convergence

Given nn training examples, the empirical risk R^n(h)=1ni=1n(h(xi),yi)\hat{R}_n(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(x_i), y_i) defines a sequence indexed by nn. The law of large numbers tells us R^n(h)R(h)\hat{R}_n(h) \to R(h) (the true risk) as nn \to \infty — this is a sequence convergence result. The uniform version — suphHR^n(h)R(h)0\sup_{h \in \mathcal{H}} |\hat{R}_n(h) - R(h)| \to 0 — is the Glivenko-Cantelli theorem, which is the foundation of PAC learning. See formalML for how this generalizes to function classes and VC dimension.

Computational Notes

Here are the key computational patterns for working with sequences numerically.

Generating and testing sequences with NumPy:

import numpy as np

# Generate sequence terms
n = np.arange(1, 201)
a_n = 1 / n                          # decaying
b_n = (1 + 1/n)**n                   # approaching e
c_n = np.cumprod(np.full(200, 0.9))  # geometric decay

Numerical Cauchy criterion check:

def check_cauchy(seq, N, window=50):
    """Check max |a_m - a_n| for m, n >= N."""
    tail = seq[N:N+window]
    max_gap = np.max(np.abs(tail[:, None] - tail[None, :]))
    return max_gap

# Cauchy: partial sums of 1/k^2
partial_sums = np.cumsum(1 / n**2)
print(f"Cauchy gap at N=100: {check_cauchy(partial_sums, 100):.6f}")

# Not Cauchy: harmonic series
harmonic = np.cumsum(1 / n)
print(f"Harmonic gap at N=100: {check_cauchy(harmonic, 100):.4f}")

Estimating convergence rate from iterates:

def estimate_rate(errors):
    """Estimate convergence rate from error sequence."""
    ratios = errors[1:] / errors[:-1]
    avg_ratio = np.mean(ratios[-10:])  # use tail for stability
    if avg_ratio < 0.01:
        return 'superlinear'
    elif avg_ratio < 0.95:
        return f'linear (r ≈ {avg_ratio:.3f})'
    else:
        return 'sublinear'

# GD on f(x) = x^2 with eta = 0.3
x = 4.0
errors = []
for _ in range(50):
    x = x - 0.3 * 2 * x  # GD update
    errors.append(abs(x))

print(estimate_rate(np.array(errors)))
# → "linear (r ≈ 0.400)"

Extracting iteration sequences from SciPy:

from scipy.optimize import minimize

iterates = []
def callback(xk):
    iterates.append(xk.copy())

result = minimize(lambda x: x[0]**2 + x[1]**2, x0=[3.0, 4.0],
                  method='CG', callback=callback)
# iterates now contains the full optimization path

Connections & Further Reading

Where this leads within formalCalculus

  • Epsilon-Delta & Continuity (coming soon) extends the ε\varepsilon-NN framework to function limits and continuity. The ε\varepsilon-δ\delta definition is the function-level analogue of what we did here for sequences.
  • The Riemann Integral & FTC (coming soon) defines the integral as a limit of Riemann sums — each sum is a term in a sequence, and convergence of that sequence is the integral.
  • Series Convergence & Tests (coming soon) treats partial sums of infinite series as sequences. The Robbins-Monro conditions from §8 are formalized as series convergence criteria.
  • Completeness & Compactness (coming soon) generalizes the Bolzano-Weierstrass theorem from R\mathbb{R} to Rn\mathbb{R}^n and introduces the Heine-Borel theorem — the compactness machinery that guarantees the existence of minimizers on closed bounded sets.

Where this leads in machine learning

The convergence theory developed here is the direct foundation for:

  • formalML Gradient Descent — convergence rates, step size conditions, and strong convexity.
  • formalML PAC Learning — uniform convergence of empirical processes.
  • formalML Random Walks & MCMC — mixing time as a convergence rate.
  • formalML Concentration Inequalities — quantitative convergence bounds.
  • formalML Measure-Theoretic Probability — modes of convergence (a.s., in probability, in distribution).

References

  • Abbott (2015). Understanding Analysis. Chapters 2–3. The gold standard for rigorous-but-accessible treatment of sequences and limits.
  • Rudin (1976). Principles of Mathematical Analysis. Chapter 3. The definitive reference, terse but complete.
  • Tao (2016). Analysis I. Chapters 5–6. Builds the real numbers from Cauchy sequences — ideal for seeing the foundations from scratch.
  • Boyd & Vandenberghe (2004). Convex Optimization. Sections 9.2–9.3. Gradient descent convergence rates as direct applications.
  • Robbins & Monro (1951). “A Stochastic Approximation Method.” The original learning rate conditions for SGD convergence.