Linear regression as a basic model

Case 2. Predicting a developer’s task completion time

In this example we use Ridge linear regression from RubixML to predict a developer’s task completion time based on several numeric features and at the same time inspect the model weights and bias.

 
<?php

use Rubix\ML\Datasets\Labeled;
use 
Rubix\ML\Datasets\Unlabeled;
use 
Rubix\ML\Regressors\Ridge;

// Training samples: each row is a completed task with its feature vector
// x = (x1, x2, x3, x4):
// [story_points (x1), files_changed (x2), lines_changed (x3), developer_experience (x4)]
$samples = [
    [
5320024],
    [
8550012],
    [
3110036],
];

// Target values: actual completion time in hours for each task above
$labels = [
    
6.5,  // hours for task 1
    
12.0// hours for task 2
    
4.0,  // hours for task 3
];

// Build a labeled dataset (features X + targets y)
$dataset = new Labeled($samples$labels);

// Ridge regression: linear regression with L2 regularization
// With a tiny alpha (1e-6) it behaves almost like ordinary least squares.
$model = new Ridge(1e-6);

// Train the model
$model->train($dataset);

$newTask = [6430018];
$unlabeled = new Unlabeled([$newTask]);
$predictions $model->predict($unlabeled);

$weights $model->coefficients();
$bias $model->bias();

echo 
'Estimated task completion time: ' round($predictions[0], 1) . 'h' PHP_EOL PHP_EOL;

echo 
'Coefficients (feature weights):' PHP_EOL;
echo 
'0 - story_points, 1 - files_changed, 2 - lines_changed, 3 - developer_experience' PHP_EOL;
print_r($weights);

echo 
PHP_EOL 'Bias (intercept): ' $bias PHP_EOL;
Result: Memory: 0.994 Mb Time running: 0.009 sec.
Estimated task completion time: 8.5h

Coefficients (feature weights):
0 - story_points, 1 - files_changed, 2 - lines_changed, 3 - developer_experience
Array
(
    [0] => 0.010057024657726
    [1] => 0.013309612870216
    [2] => 0.014949715230614
    [3] => -0.079857930541039
)

Bias (intercept): 5.3364334106445

Now the model is explainable (a weight is a coefficient):

  • the weight at story points shows how many hours on average one additional SP adds;
  • the weight at the number of files reflects context-switching overhead;
  • the weight at the number of lines often correlates with the amount of manual work;
  • a negative weight at developer experience is expected and logical (the more experience, the less time is usually needed for the same task, so the relationship is inverse and the weight becomes negative);

This kind of model can already be discussed with the team, and the feature set can be adjusted consciously.