Linear regression as a basic model
Case 5. Predicting market salary
We use Ridge regression (linear regression with L2 regularization): it is more stable on small datasets and helps reduce overfitting.
Example of use
<?php
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Regressors\Ridge;
// Features: experience_years, technologies_score, company_size_level, remote
$samples = [
[1, 2, 1, 1],
[3, 4, 2, 1],
[5, 6, 3, 0],
[7, 8, 3, 1],
[10, 10, 3, 1],
];
// Target: salary_usd
$labels = [
1500,
2800,
4500,
6200,
8000,
];
$dataset = Labeled::build($samples, $labels);
$model = new Ridge(1.0);
$model->train($dataset);
// Candidate features for prediction
// [experience_years, technologies_score, company_size_level, remote]
$candidate = [4, 5, 2, 1];
$unlabeled = new Unlabeled([$candidate]);
$prediction = $model->predict($unlabeled);
$salary = $prediction[0];
$weights = array_map(function ($weight) {
return number_format($weight, 2);
}, $model->coefficients());
$bias = number_format($model->bias(), 2);
echo 'Expected salary: ' . round($salary, 2) . PHP_EOL . PHP_EOL;
echo 'Coefficients (feature weights):' . PHP_EOL;
echo '0 - experience_years, 1 - technologies_score, 2 - company_size_level, 3 - remote' . PHP_EOL;
print_r($weights);
echo PHP_EOL . 'Bias (intercept): ' . $bias . PHP_EOL;
Result:
Memory: 0.995 Mb
Time running: 0.011 sec.
Expected salary: 3750
Coefficients (feature weights):
0 - experience_years, 1 - technologies_score, 2 - company_size_level, 3 - remote
Array
(
[0] => 387.50
[1] => 375.00
[2] => 37.50
[3] => 25.00
)
Bias (intercept): 225.00
How to interpret the output (weights and bias):
- the weight for experience shows how much salary increases on average per additional year;
- the weight for the technologies score reflects the premium for a more in-demand stack;
- the weight for company size often correlates with compensation bands;
- the weight for remote work may capture a market adjustment (but in practice it heavily depends on region and company policy).
In a real task you would add much more data, normalize features, include categorical signals (region, level, industry), and evaluate the model on a test set.