Why Naive Bayes works

Case 1. Categorical features and frequencies

Implementation in pure PHP

In this example we implement Naive Bayes by hand for a pair of simple categorical features and see what log-probability scores we get for each class.

Example of use


                
<?php

include __DIR__ . '/code.php';

// Classify a new sample using Naive Bayes.
$input = ['from_ads' => true, 'has_account' => true];

// We will store a score for each class (log-probability).
$scores = [];

foreach ($classCounts as $class => $count) {
    // Start with the log prior: log P(class).
    $logProb = log($count / count($data));

    foreach ($input as $feature => $value) {
        // Booleans become 0/1 keys in PHP arrays.
        $valueKey = (int)$value;

        // Count how often feature=value occurs in this class.
        $featureCount = $featureCounts[$class][$feature][$valueKey] ?? 0;
        $total = $classCounts[$class];

        // Add log P(feature=value | class) using Laplace smoothing:
        // (count + 1) / (total + K), where K is number of possible values.
        // Here K=2 because the feature is boolean.
        $logProb += log(($featureCount + 1) / ($total + 2));
    }

    // Final score for this class.
    $scores[$class] = $logProb;
}

// Highest score (closest to 0) corresponds to the most likely class.
// The first array key after sorting is the predicted class.
arsort($scores);

// Print raw scores for inspection.
print_r($scores);

Result: Memory: 0.007 Mb Time running: < 0.001 sec.


                Array
(
    [buyer] => -1.6739764335717
    [browser] => -2.3671236141316
)

The output is shown as logarithms of the probability scores for each class. The larger the value (i.e. closer to 0), the more likely the class. We sum log-probabilities per feature and apply Laplace smoothing, so after sorting the first class is the predicted one.