Case 3: Comparing texts

Implementation in pure PHP

In this variant we compare texts using word-frequency vectorization (TF) and cosine similarity. We convert documents and query to term-frequency vectors, compute similarity, and sort results by descending relevance.

Example of code


<?php

function textToVector(string $text): array {
    $tokens = preg_split('/\W+/u', mb_strtolower($text), -1, PREG_SPLIT_NO_EMPTY);
    $vector = [];

    foreach ($tokens as $token) {
        $vector[$token] = ($vector[$token] ?? 0) + 1;
    }

    return $vector;
}

function alignVectors(array $a, array $b): array {
    $keys = array_unique(array_merge(array_keys($a), array_keys($b)));

    $va = [];
    $vb = [];

    foreach ($keys as $key) {
        $va[] = $a[$key] ?? 0;
        $vb[] = $b[$key] ?? 0;
    }

    return [$va, $vb];
}

function dotProduct(array $a, array $b): float {
    $n = count($a);

    if ($n !== count($b)) {
        throw new InvalidArgumentException('Vectors must have the same length');
    }

    $sum = 0.0;

    for ($i = 0; $i < $n; $i++) {
        $sum += $a[$i] * $b[$i];
    }

    return $sum;
}

function cosineSimilarity(array $a, array $b): float {
    $dot = dotProduct($a, $b);
    $normA = sqrt(dotProduct($a, $a));
    $normB = sqrt(dotProduct($b, $b));

    if ($normA == 0.0 || $normB == 0.0) {
        return 0.0;
    }

    return $dot / ($normA * $normB);
}

Implementation in RubixML

In the second variant we use RubixML: normalize text, apply WordCount, then transform features into TF-IDF vectors. After that we compare query and documents via cosine distance and rank texts by similarity score.

Example of code


<?php

use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\TfIdfTransformer;
use Rubix\ML\Transformers\WordCountVectorizer;

$documents = [
    'Machine learning is used to analyze data, which is crucial for the outcome.',
    'Deep learning is a branch of machine learning.',
    'Cats and dogs are popular pets.',
];

$query = ['machine learning algorithms'];

$dataset = new Unlabeled(array_merge($documents, $query));

$dataset->apply(new TextNormalizer());
$dataset->apply(new WordCountVectorizer());

$tfidf = new TfIdfTransformer();
$tfidf->fit($dataset);
$dataset->apply($tfidf);