Bag of Words and TF–IDF

In the previous chapters we talked about text as data and about the fact that a computer cannot read words "like a human". For it, text is a set of symbols, numbers, and statistics. In this section we will cover two basic, yet still extremely useful approaches to representing text as numbers: Bag of Words and TF–IDF.

  • A simple TF–IDF example in PHP
  • Case 1. Similar document search
  • Case 2. Review classification: "positive / negative"
  • Case 3. Automatic article categorization
  • Case 4. "Spam" detector for contact forms
  • Case 5. Explainable search: "why this document?"
  • Case 6. Comparison: Bag of Words vs TF–IDF on one example