Linear regression as a basic model

Case 4. Estimating the likely customer check

In this example we use Ridge regression from RubixML. The model is trained on a small dataset, predicts log(check), and then we convert the result back to money with exp().

 
<?php

use Rubix\ML\Datasets\Labeled;
use 
Rubix\ML\Datasets\Unlabeled;
use 
Rubix\ML\Regressors\Ridge;

// Features: visits, time_on_site_seconds, pageviews, discount_percent
$samples = [
    [
342050],
    [
1018002010],
    [
112020],
    [
7900125],
];

// Target: log(check_amount)
$labels = [
    
log(3500),
    
log(12000),
    
log(1800),
    
log(7200),
];

$dataset Labeled::build($samples$labels);

// Ridge regression (L2 regularization)
// With a tiny alpha (1e-6) it behaves almost like ordinary least squares.
$model = new Ridge(1e-6);
$model->train($dataset);

// Customer features for prediction
// [visits, time_on_site_seconds, pageviews, discount_percent]
$customer = [560085];

$unlabeled = new Unlabeled([$customer]);
$logPrice $model->predict($unlabeled);
$predictedPrice exp($logPrice[0]);

$weights $model->coefficients();
$bias $model->bias();

echo 
'Predicted check: ' round($predictedPrice2) . PHP_EOL;
echo 
'Predicted log(check): ' round($logPrice[0], 6) . PHP_EOL PHP_EOL;

echo 
'Coefficients (feature weights):' PHP_EOL;
echo 
'0 - visits, 1 - time_on_site_seconds, 2 - pageviews, 3 - discount_percent' PHP_EOL;
print_r($weights);

echo 
PHP_EOL 'Bias (intercept): ' $bias PHP_EOL;
Result: Memory: 0.994 Mb Time running: 0.012 sec.
Predicted check: 3282.17
Predicted log(check): 8.096259

Coefficients (feature weights):
0 - visits, 1 - time_on_site_seconds, 2 - pageviews, 3 - discount_percent
Array
(
    [0] => 0.19882501289248
    [1] => -0.00031492511334363
    [2] => 0.12060119584203
    [3] => -0.15340525191277
)

Bias (intercept): 7.0933057013899

Why predicting log(check) is a useful trick:

  • checks often have a heavy tail; log() reduces the impact of rare large purchases;
  • in log-space, the model often better matches multiplicative effects (e.g. "twice as many pageviews");
  • exp() guarantees a positive predicted check;
  • Ridge adds L2 regularization and helps reduce overfitting on small datasets.

In a real product you would expand the dataset and feature set (traffic source, category interest, purchase history), but even this baseline can be a good starting point.