Bag of Words and TF–IDF
In the previous chapters we talked about text as data and about the fact that a computer cannot read words "like a human". For it, text is a set of symbols, numbers, and statistics. In this section we will cover two basic, yet still extremely useful approaches to representing text as numbers: Bag of Words and TF–IDF.
- A simple TF–IDF example in PHP
- Case 1. Similar document search
- Case 2. Review classification: "positive / negative"
- Case 3. Automatic article categorization
- Case 4. "Spam" detector for contact forms
- Case 5. Explainable search: "why this document?"
- Case 6. Comparison: Bag of Words vs TF–IDF on one example