Case 1. Similar document search

Implementation in pure PHP

One of the most natural TF–IDF use cases is similar-text search. In this case, we build a mini search engine in pure PHP: tokenize documents and query, compute TF–IDF vectors, compare them with cosine similarity, and return documents that are the closest to the user query.

 
<?php

include 'code-en.php';

$query 'I cannot recover the password of user';

$queryTokens tokenize($query);
$queryTf termFrequency($queryTokens);
$queryVector tfidf($queryTf$idf);

$results = [];

foreach (
$documentVectors as $id => $vector) {
    
$results[$id] = cosineSimilarity(
        
$queryVector,
        
$vector
    
);
}

arsort($results);

echo 
'Results:' PHP_EOL '------------' PHP_EOL;

foreach (
$results as $id => $score) {
    echo 
'Document ' $id ': ' round($score2) . ' (' $documents[$id] . ')' PHP_EOL;
}
echo 
PHP_EOL PHP_EOL;


echo 
'Document vectors:' PHP_EOL '------------' PHP_EOL;

foreach (
$documentVectors as $id => $vector) {
    echo 
'Document ' $id ': ' PHP_EOL;
    
print_r($vector);
    echo 
PHP_EOL;
}
echo 
PHP_EOL;

echo 
'IDF:' PHP_EOL '------------' PHP_EOL;
print_r($idf);

Documents:

1. How to reset a user password
2. Database connection error
3. Configuring SMTP for sending email
4. Restoring access to a user account

Query: I cannot recover the password of user
Result: Memory: 0.017 Mb Time running: < 0.001 sec.
Results:
------------
Document 1: 0.58 (How to reset a user password)
Document 4: 0.12 (Restoring access to a user account)
Document 2: 0 (Database connection error)
Document 3: 0 (Configuring SMTP for sending email)


Document vectors:
------------
Document 1: 
Array
(
    [how] => 0.23104906018665
    [to] => 0.11552453009332
    [reset] => 0.23104906018665
    [a] => 0.11552453009332
    [user] => 0.11552453009332
    [password] => 0.23104906018665
)

Document 2: 
Array
(
    [database] => 0.4620981203733
    [connection] => 0.4620981203733
    [error] => 0.4620981203733
)

Document 3: 
Array
(
    [configuring] => 0.27725887222398
    [smtp] => 0.27725887222398
    [for] => 0.27725887222398
    [sending] => 0.27725887222398
    [email] => 0.27725887222398
)

Document 4: 
Array
(
    [restoring] => 0.23104906018665
    [access] => 0.23104906018665
    [to] => 0.11552453009332
    [a] => 0.11552453009332
    [user] => 0.11552453009332
    [account] => 0.23104906018665
)


IDF:
------------
Array
(
    [how] => 1.3862943611199
    [to] => 0.69314718055995
    [reset] => 1.3862943611199
    [a] => 0.69314718055995
    [user] => 0.69314718055995
    [password] => 1.3862943611199
    [database] => 1.3862943611199
    [connection] => 1.3862943611199
    [error] => 1.3862943611199
    [configuring] => 1.3862943611199
    [smtp] => 1.3862943611199
    [for] => 1.3862943611199
    [sending] => 1.3862943611199
    [email] => 1.3862943611199
    [restoring] => 1.3862943611199
    [access] => 1.3862943611199
    [account] => 1.3862943611199
)