Probability as degree of confidence

Case 1. Spam filter: probability ≠ decision

When developers hear the word "probability", they often imagine dice, coin flips, and the school formula "favorable outcomes divided by all possible outcomes". This is useful, but a very narrow picture. In machine learning and applied analytics, probability almost always means something else – the degree of our confidence in a statement given the available data.

Example of use:

 
<?php

use Rubix\ML\Classifiers\LogisticRegression;
use 
Rubix\ML\Datasets\Labeled;
use 
Rubix\ML\Datasets\Unlabeled;

// Each sample represents an email.
// Features: [subject length (tokens), number of links in the message body]
$samples = [
    [
31],   // short subject, few links
    
[158],  // long subject, many links
    
[50],   // medium subject, no links
];

// Class labels for each email: inbox mail vs spam
$labels = ['normal''spam''normal'];

// Supervised training dataset: feature vectors + class labels
$dataset = new Labeled($samples$labels);

// Logistic regression classifier from RubixML
$model = new LogisticRegression();

// Train the classifier on the labeled dataset
$model->train($dataset);

// New incoming email we want to classify
// Same feature schema: [subject length, number of links]
$sample = new Unlabeled([[126]]);

// Predicted probability for each class (e.g. ['normal' => 0.32, 'spam' => 0.68])
$probabilities $model->proba($sample)[0];