Error, loss functions, and why they are needed

Case 1. MSE and the cost of a big miss

Imagine a service that estimates apartment prices. Nothing too fancy: we input the square footage and get a predicted price. This is a classic regression task, and MSE is almost the default choice of loss function. But this is exactly where we can clearly see the price we pay for that choice.

We can implement MSE in just a few lines of code, without any libraries, and then ruin the whole picture with a single data point. Suppose our dataset now contains a strange apartment: it could be a data error, a unique property, or simply a very bad prediction.

Example of use


                
<?php

require_once __DIR__ . '/code.php';

// Scenario 1. Data without an outlier
// The model predicts prices reasonably well, the errors are moderate.
$y = [100, 120, 130, 115, 125];
$yHat = [102, 118, 128, 117, 123];

// Compute MSE for "normal" data without strong anomalies.
echo 'Normal MSE: ' . mse($y, $yHat) . PHP_EOL;

// Scenario 2. Add an outlier (anomalous point):
// Imagine that our dataset now contains one very "weird" apartment — either a data error or a truly unusual object.
$y[] = 300;
$yHat[] = 130;

// Compute MSE again. A single outlier drastically increases the average error, showing how sensitive MSE is to large mistakes.
echo 'MSE with outlier: ' . mse($y, $yHat) . PHP_EOL;

Result: Memory: 0.002 Mb Time running: 0.001 sec.


                Normal MSE: 4
MSE with outlier: 4820

Normal MSE: $4$
After adding the outlier, the MSE jumps to 4820.
This happens because of a single term in the sum (for the outlier): $(300 - 130)^2 = 170^2 = 28900$.