Gradient descent on fingers

Example 3. Plateau and near-zero gradient

In this example we look at what happens to gradient descent when the gradient formally exists, but its value becomes extremely small.

To isolate the effect, we intentionally move away from real data and use an artificial loss function with a flat region (a plateau) near the minimum.

Goal: understand why training may look like it is "stuck" (almost not moving), even though the updates still happen.

Let’s take the loss function:
$$L(w) = (w - 3)^4$$

Example of code:


<?php

$w = 0.0;   // Parameter we optimize.
$lr = 0.05; // Learning rate (step size).

echo "epoch\tw\t\tgradient\tloss\n";

for ($epoch = 1; $epoch <= 25; $epoch++) {
    // Loss function with a very flat region near the minimum.
    $loss = ($w - 3) ** 4;
    // Derivative: d/dw (w - 3)^4 = 4 (w - 3)^3.
    $gradient = 4 * ($w - 3) ** 3;

    // Gradient descent update.
    $w -= $lr * $gradient;
}