Linear regression as a basic model

Case 5. Predicting market salary

We use Ridge regression (linear regression with L2 regularization): it is more stable on small datasets and helps reduce overfitting.

 
<?php


use Rubix\ML\Datasets\Labeled;
use 
Rubix\ML\Datasets\Unlabeled;
use 
Rubix\ML\Regressors\Ridge;

// Features: experience_years, technologies_score, company_size_level, remote
$samples = [
    [
1211],
    [
3421],
    [
5630],
    [
7831],
    [
101031],
];

// Target: salary_usd
$labels = [
    
1500,
    
2800,
    
4500,
    
6200,
    
8000,
];

$dataset Labeled::build($samples$labels);

$model = new Ridge(1.0);
$model->train($dataset);

// Candidate features for prediction
// [experience_years, technologies_score, company_size_level, remote]
$candidate = [4521];

$unlabeled = new Unlabeled([$candidate]);
$prediction $model->predict($unlabeled);

$salary $prediction[0];

$weights array_map(function ($weight) {
    return 
number_format($weight2);
}, 
$model->coefficients());
$bias number_format($model->bias(), 2);

echo 
'Expected salary: ' round($salary2) . PHP_EOL PHP_EOL;

echo 
'Coefficients (feature weights):' PHP_EOL;
echo 
'0 - experience_years, 1 - technologies_score, 2 - company_size_level, 3 - remote' PHP_EOL;
print_r($weights);

echo 
PHP_EOL 'Bias (intercept): ' $bias PHP_EOL;
Result: Memory: 0.995 Mb Time running: 0.011 sec.
Expected salary: 3750

Coefficients (feature weights):
0 - experience_years, 1 - technologies_score, 2 - company_size_level, 3 - remote
Array
(
    [0] => 387.50
    [1] => 375.00
    [2] => 37.50
    [3] => 25.00
)

Bias (intercept): 225.00

How to interpret the output (weights and bias):

  • the weight for experience shows how much salary increases on average per additional year;
  • the weight for the technologies score reflects the premium for a more in-demand stack;
  • the weight for company size often correlates with compensation bands;
  • the weight for remote work may capture a market adjustment (but in practice it heavily depends on region and company policy).

In a real task you would add much more data, normalize features, include categorical signals (region, level, industry), and evaluate the model on a test set.