Case 1. Similar document search
Implementation in pure PHP
One of the most natural TF–IDF use cases is similar-text search. In this case, we build a mini search engine in pure PHP: tokenize documents and query, compute TF–IDF vectors, compare them with cosine similarity, and return documents that are the closest to the user query.
Example of use
<?php
include 'code-en.php';
$query = 'I cannot recover the password of user';
$queryTokens = tokenize($query);
$queryTf = termFrequency($queryTokens);
$queryVector = tfidf($queryTf, $idf);
$results = [];
foreach ($documentVectors as $id => $vector) {
$results[$id] = cosineSimilarity(
$queryVector,
$vector
);
}
arsort($results);
echo 'Results:' . PHP_EOL . '------------' . PHP_EOL;
foreach ($results as $id => $score) {
echo 'Document ' . $id . ': ' . round($score, 2) . ' (' . $documents[$id] . ')' . PHP_EOL;
}
echo PHP_EOL . PHP_EOL;
echo 'Document vectors:' . PHP_EOL . '------------' . PHP_EOL;
foreach ($documentVectors as $id => $vector) {
echo 'Document ' . $id . ': ' . PHP_EOL;
print_r($vector);
echo PHP_EOL;
}
echo PHP_EOL;
echo 'IDF:' . PHP_EOL . '------------' . PHP_EOL;
print_r($idf);
Documents:
1. How to reset a user password
2. Database connection error
3. Configuring SMTP for sending email
4. Restoring access to a user account
Query: I cannot recover the password of user
Result:
Memory: 0.017 Mb
Time running: < 0.001 sec.
Results:
------------
Document 1: 0.58 (How to reset a user password)
Document 4: 0.12 (Restoring access to a user account)
Document 2: 0 (Database connection error)
Document 3: 0 (Configuring SMTP for sending email)
Document vectors:
------------
Document 1:
Array
(
[how] => 0.23104906018665
[to] => 0.11552453009332
[reset] => 0.23104906018665
[a] => 0.11552453009332
[user] => 0.11552453009332
[password] => 0.23104906018665
)
Document 2:
Array
(
[database] => 0.4620981203733
[connection] => 0.4620981203733
[error] => 0.4620981203733
)
Document 3:
Array
(
[configuring] => 0.27725887222398
[smtp] => 0.27725887222398
[for] => 0.27725887222398
[sending] => 0.27725887222398
[email] => 0.27725887222398
)
Document 4:
Array
(
[restoring] => 0.23104906018665
[access] => 0.23104906018665
[to] => 0.11552453009332
[a] => 0.11552453009332
[user] => 0.11552453009332
[account] => 0.23104906018665
)
IDF:
------------
Array
(
[how] => 1.3862943611199
[to] => 0.69314718055995
[reset] => 1.3862943611199
[a] => 0.69314718055995
[user] => 0.69314718055995
[password] => 1.3862943611199
[database] => 1.3862943611199
[connection] => 1.3862943611199
[error] => 1.3862943611199
[configuring] => 1.3862943611199
[smtp] => 1.3862943611199
[for] => 1.3862943611199
[sending] => 1.3862943611199
[email] => 1.3862943611199
[restoring] => 1.3862943611199
[access] => 1.3862943611199
[account] => 1.3862943611199
)