Linear regression as a basic model

Case 2. Predicting a developer’s task completion time

In this example we use Ridge linear regression from RubixML to predict a developer’s task completion time based on several numeric features and at the same time inspect the model weights and bias.

Example of use


                
<?php

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Regressors\Ridge;

// Training samples: each row is a completed task with its feature vector
// x = (x1, x2, x3, x4):
// [story_points (x1), files_changed (x2), lines_changed (x3), developer_experience (x4)]
$samples = [
    [5, 3, 200, 24],
    [8, 5, 500, 12],
    [3, 1, 100, 36],
];

// Target values: actual completion time in hours for each task above
$labels = [
    6.5,  // hours for task 1
    12.0, // hours for task 2
    4.0,  // hours for task 3
];

// Build a labeled dataset (features X + targets y)
$dataset = new Labeled($samples, $labels);

// Ridge regression: linear regression with L2 regularization
// With a tiny alpha (1e-6) it behaves almost like ordinary least squares.
$model = new Ridge(1e-6);

// Train the model
$model->train($dataset);

$newTask = [6, 4, 300, 18];
$unlabeled = new Unlabeled([$newTask]);
$predictions = $model->predict($unlabeled);

$weights = $model->coefficients();
$bias = $model->bias();

echo 'Estimated task completion time: ' . round($predictions[0], 1) . 'h' . PHP_EOL . PHP_EOL;

echo 'Coefficients (feature weights):' . PHP_EOL;
echo '0 - story_points, 1 - files_changed, 2 - lines_changed, 3 - developer_experience' . PHP_EOL;
print_r($weights);

echo PHP_EOL . 'Bias (intercept): ' . $bias . PHP_EOL;

Result: Memory: 1.069 Mb Time running: 0.011 sec.


                Estimated task completion time: 8.5h

Coefficients (feature weights):
0 - story_points, 1 - files_changed, 2 - lines_changed, 3 - developer_experience
Array
(
    [0] => 0.010057024657726
    [1] => 0.013309612870216
    [2] => 0.014949715230614
    [3] => -0.079857930541039
)

Bias (intercept): 5.3364334106445

Now the model is explainable (a weight is a coefficient):

the weight at story points shows how many hours on average one additional SP adds;
the weight at the number of files reflects context-switching overhead;
the weight at the number of lines often correlates with the amount of manual work;
a negative weight at developer experience is expected and logical (the more experience, the less time is usually needed for the same task, so the relationship is inverse and the weight becomes negative);

This kind of model can already be discussed with the team, and the feature set can be adjusted consciously.