Linear regression as a basic model

Case 2. Predicting a developer’s task completion time

In real teams, the question "how long will this task take?" comes up almost every day. Release dates, developer workload, and business expectations all depend on it. Estimates are usually given by gut feeling — based on the team lead’s or the developer’s own experience. Linear regression lets us formalize this process and build a baseline model that predicts time based on task features.

Suppose we have historical task data: for each task we know how many hours it actually took and a feature vector $\mathbf{x} = (x_1, x_2, x_3, x_4)$, where:

  • $x_1$ — story points (a measure of task complexity);
  • $x_2$ — number of files touched by the change;
  • $x_3$ — number of changed lines (code diff);
  • $x_4$ — developer experience (for example, in years or as an encoded level junior/middle/senior);

We want to predict the target quantity – the actual task completion time in hours. Linear regression in this case defines a simple, interpretable model $ŷ = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4$, where the components of $\mathbf{x}$ describe the task and the coefficients $w_1, \dots, w_4$ and intercept $w_0$ are fitted on historical data.

Example of use:

 
<?php

use Rubix\ML\Datasets\Labeled;
use 
Rubix\ML\Regressors\Ridge;

// Training samples: each row is a completed task with its feature vector
// x = (x1, x2, x3, x4):
// [story_points (x1), files_changed (x2), lines_changed (x3), developer_experience (x4)]
$samples = [
    [
5320024],
    [
8550012],
    [
3110036],
];

// Target values: actual completion time in hours for each task above
$labels = [
    
6.5,  // hours for task 1
    
12.0// hours for task 2
    
4.0,  // hours for task 3
];

// Build a labeled dataset (features X + targets y)
$dataset = new Labeled($samples$labels);

// Ridge regression: linear regression with L2 regularization
// With alpha = 1e-6, Ridge regression is equivalent to linear regression
$model = new Ridge(1e-6);

// Train the model
$model->train($dataset);