Linear regression as a basic model

Case 4. Estimating the likely customer check

In this example we use Ridge regression from RubixML. The model is trained on a small dataset, predicts log(check), and then we convert the result back to money with exp().

Example of use


                
<?php

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Regressors\Ridge;

// Features: visits, time_on_site_seconds, pageviews, discount_percent
$samples = [
    [3, 420, 5, 0],
    [10, 1800, 20, 10],
    [1, 120, 2, 0],
    [7, 900, 12, 5],
];

// Target: log(check_amount)
$labels = [
    log(3500),
    log(12000),
    log(1800),
    log(7200),
];

$dataset = Labeled::build($samples, $labels);

// Ridge regression (L2 regularization)
// With a tiny alpha (1e-6) it behaves almost like ordinary least squares.
$model = new Ridge(1e-6);
$model->train($dataset);

// Customer features for prediction
// [visits, time_on_site_seconds, pageviews, discount_percent]
$customer = [5, 600, 8, 5];

$unlabeled = new Unlabeled([$customer]);
$logPrice = $model->predict($unlabeled);
$predictedPrice = exp($logPrice[0]);

$weights = $model->coefficients();
$bias = $model->bias();

echo 'Predicted check: ' . round($predictedPrice, 2) . PHP_EOL;
echo 'Predicted log(check): ' . round($logPrice[0], 6) . PHP_EOL . PHP_EOL;

echo 'Coefficients (feature weights):' . PHP_EOL;
echo '0 - visits, 1 - time_on_site_seconds, 2 - pageviews, 3 - discount_percent' . PHP_EOL;
print_r($weights);

echo PHP_EOL . 'Bias (intercept): ' . $bias . PHP_EOL;

Result: Memory: 0.004 Mb Time running: 0.006 sec.


                Predicted check: 3282.17
Predicted log(check): 8.096259

Coefficients (feature weights):
0 - visits, 1 - time_on_site_seconds, 2 - pageviews, 3 - discount_percent
Array
(
    [0] => 0.19882501289248
    [1] => -0.00031492511334363
    [2] => 0.12060119584203
    [3] => -0.15340525191277
)

Bias (intercept): 7.0933057013899

Why predicting log(check) is a useful trick:

checks often have a heavy tail; log() reduces the impact of rare large purchases;
in log-space, the model often better matches multiplicative effects (e.g. "twice as many pageviews");
exp() guarantees a positive predicted check;
Ridge adds L2 regularization and helps reduce overfitting on small datasets.

In a real product you would expand the dataset and feature set (traffic source, category interest, purchase history), but even this baseline can be a good starting point.