Case 3: Comparing texts
Implementation in pure PHP
In this variant we compare texts using word-frequency vectorization (TF) and cosine similarity. We convert documents and query to term-frequency vectors, compute similarity, and sort results by descending relevance.
Example of code
<?php
function textToVector(string $text): array {
$tokens = preg_split('/\W+/u', mb_strtolower($text), -1, PREG_SPLIT_NO_EMPTY);
$vector = [];
foreach ($tokens as $token) {
$vector[$token] = ($vector[$token] ?? 0) + 1;
}
return $vector;
}
function alignVectors(array $a, array $b): array {
$keys = array_unique(array_merge(array_keys($a), array_keys($b)));
$va = [];
$vb = [];
foreach ($keys as $key) {
$va[] = $a[$key] ?? 0;
$vb[] = $b[$key] ?? 0;
}
return [$va, $vb];
}
function dotProduct(array $a, array $b): float {
$n = count($a);
if ($n !== count($b)) {
throw new InvalidArgumentException('Vectors must have the same length');
}
$sum = 0.0;
for ($i = 0; $i < $n; $i++) {
$sum += $a[$i] * $b[$i];
}
return $sum;
}
function cosineSimilarity(array $a, array $b): float {
$dot = dotProduct($a, $b);
$normA = sqrt(dotProduct($a, $a));
$normB = sqrt(dotProduct($b, $b));
if ($normA == 0.0 || $normB == 0.0) {
return 0.0;
}
return $dot / ($normA * $normB);
}
Implementation in RubixML
In the second variant we use RubixML: normalize text, apply WordCount, then transform features into TF-IDF vectors. After that we compare query and documents via cosine distance and rank texts by similarity score.
Example of code
<?php
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TfIdfTransformer;
use Rubix\ML\Kernels\Distance\Cosine;
$documents = [
'Machine learning is used to analyze data, which is crucial for the outcome.',
'Deep learning is a branch of machine learning.',
'Cats and dogs are popular pets.'
];
$query = ['machine learning algorithms'];
$dataset = new Unlabeled(array_merge($documents, $query));
$dataset->apply(new TextNormalizer());
$dataset->apply(new WordCountVectorizer());
$tfidf = new TfIdfTransformer();
$tfidf->fit($dataset);
$dataset->apply($tfidf);