A simple TF–IDF example in PHP
A simple TF–IDF example in PHP
Below is a minimal TF–IDF implementation in pure PHP. We take three short documents (about a cat and a dog), build a vocabulary, compute TF and IDF, and then produce TF–IDF weights for each term in each document.
Example of use
<?php
// --------------------
// Tokenization
// --------------------
function tokenize(string $text): array {
return explode(' ', $text);
}
$tokenized = array_map('tokenize', $documents ?? []);
// --------------------
// Build vocabulary
// --------------------
$vocab = [];
foreach ($tokenized as $doc) {
foreach ($doc as $word) {
$vocab[$word] = true;
}
}
$vocab = array_keys($vocab);
// --------------------
// Term Frequency (TF)
// --------------------
function termFrequency(array $doc): array {
$tf = [];
$length = count($doc);
foreach ($doc as $word) {
$tf[$word] = ($tf[$word] ?? 0) + 1;
}
foreach ($tf as $word => $count) {
$tf[$word] = $count / $length;
}
return $tf;
}
// --------------------
// Document Frequency + IDF
// --------------------
function documentFrequency(array $tokenized): array {
$df = [];
foreach ($tokenized as $doc) {
foreach (array_unique($doc) as $word) {
$df[$word] = ($df[$word] ?? 0) + 1;
}
}
return $df;
}
$df = documentFrequency($tokenized);
$N = count($tokenized);
$idf = [];
foreach ($df as $word => $count) {
$idf[$word] = log($N / $count);
}
// --------------------
// TF–IDF
// --------------------
function tfidf(array $tf, array $idf): array {
$vector = [];
foreach ($tf as $word => $value) {
$vector[$word] = $value * ($idf[$word] ?? 0);
}
return $vector;
}
$tfidfVectors = [];
foreach ($tokenized as $doc) {
$tf = termFrequency($doc);
$tfidfVectors[] = tfidf($tf, $idf);
}
echo 'Vocabulary: ' . implode(', ', $vocab) . PHP_EOL . PHP_EOL;
foreach ($tfidfVectors as $i => $vector) {
echo 'Document ' . ($i + 1) . ':' . PHP_EOL;
foreach ($vector as $word => $value) {
echo " $word => " . round($value, 3) . PHP_EOL;
}
if ($i < 2) {
echo PHP_EOL;
}
}
Documents:
The cat eats fish
The cat loves fish
The dog eats canned meat
Result:
Memory: 0.009 Mb
Time running: < 0.001 sec.
Vocabulary: The, cat, eats, fish, loves, dog, canned, meat
Document 1:
The => 0
cat => 0.101
eats => 0.101
fish => 0.101
Document 2:
The => 0
cat => 0.101
loves => 0.275
fish => 0.101
Document 3:
The => 0
dog => 0.22
eats => 0.081
canned => 0.22
meat => 0.22