NLP Text Preprocessing

Remove duplicates and extra whitespace with Pure PHP

Data cleanliness directly affects model accuracy and performance. Machine Learning models, especially in Natural Language Processing (NLP), are highly sensitive to data inconsistencies. If the same word appears multiple times with different formatting (like "word", " word", or "word "), the algorithm may treat them as separate entries. Similarly, duplicated records in structured datasets can bias model training and distort predictions.

 
<?php
// Use the TextPreprocessor class
use Apphp\MLKit\NLP\Preprocessors\TextPreprocessor;

/**
 * Example of using TextPreprocessor to remove duplicates and trim whitespace
 */
function demonstrateTextPreprocessing() {
    
// Create sample text with duplicate words and excess whitespace
    
$sampleText '  This is   a sample   text with With with duplicate duplicate words   and  excess    whitespace.
    This is a repeated sentence. This is a repeated sentence.  '
;

    echo 
"<b>Original Text:</b>";
    echo 
'<br>' htmlspecialchars($sampleText) . '<br><br>';
    echo 
'----------------------<br><br>';

    
// Create an instance of TextPreprocessor
    
$preprocessor = new TextPreprocessor();

    
// Example 1: Trim whitespace only
    
$trimmedText $preprocessor->trimWhitespace($sampleText);
    echo 
"<b>After Trimming Whitespace:</b>";
    echo 
'<br>' htmlspecialchars($trimmedText) . '<br><br>';

    
// Example 2: Remove duplicate words (case-insensitive)
    
$noDuplicatesText $preprocessor->removeDuplicates($sampleTextcaseSensitivefalse);
    echo 
"<b>After Removing Duplicate Words (case-insensitive):</b>";
    echo 
'<br>' htmlspecialchars($noDuplicatesText) . '<br><br>';

    
// Example 3: Remove duplicate words (case-sensitive)
    
$noDuplicatesCaseSensitiveText $preprocessor->removeDuplicates($sampleTextcaseSensitivetrue);
    echo 
"<b>After Removing Duplicate Words (case-sensitive):</b>";
    echo 
'<br>' htmlspecialchars($noDuplicatesCaseSensitiveText) . '<br><br>';

    
// Example 4: Remove duplicate sentences
    
$noDuplicateSentencesText $preprocessor->removeDuplicateSentences($sampleText);
    echo 
"<b>After Removing Duplicate Sentences:</b>";
    echo 
'<br>' htmlspecialchars($noDuplicateSentencesText) . '<br><br>';

    
// Example 5: Process with multiple options
    
$processedText $preprocessor->process($sampleText, [
        
'trimWhitespace' => true,
        
'removeDuplicateSentences' => true,
        
'removeDuplicateWords' => true,
        
'caseSensitive' => false,
        
'withinSentencesOnly' => true
    
]);
    echo 
"<b>After Full Processing:</b>";
    echo 
'<br>' htmlspecialchars($processedText) . '<br><br>';
}

// Call the function to demonstrate text preprocessing
demonstrateTextPreprocessing();
Result: Memory: 0.021 Mb Time running: 0.002 sec.
Original Text:
This is a sample text with With with duplicate duplicate words and excess whitespace. This is a repeated sentence. This is a repeated sentence.

----------------------

After Trimming Whitespace:
This is a sample text with With with duplicate duplicate words and excess whitespace. This is a repeated sentence. This is a repeated sentence.

After Removing Duplicate Words (case-insensitive):
This is a sample text with duplicate words and excess whitespace. repeated sentence.

After Removing Duplicate Words (case-sensitive):
This is a sample text with With duplicate words and excess whitespace. repeated sentence.

After Removing Duplicate Sentences:
This is a sample text with With with duplicate duplicate words and excess whitespace. This is a repeated sentence.

After Full Processing:
This is a sample text with duplicate words and excess whitespace. This is a repeated sentence.