TFIDF
Term Frequency-Inverse Document Frequency
Term Frequency-Inverse Document Frequency
Numerical statistic used to evaluate the importance of a word within a document relative to a collection of documents.
- TF-IDF is a key technique in natural language processing (NLP) and information retrieval that quantifies the relevance of a word in a given document compared to its frequency across a larger corpus of documents. It combines two metrics: Term Frequency (TF), which measures how often a word appears in a document, and Inverse Document Frequency (IDF), which diminishes the weight of commonly occurring words across all documents in the corpus. TF-IDF helps differentiate important terms (such as unique keywords) from frequent but less informative ones (like "the" or "and"). It is widely used in search engines, document ranking, and text mining applications as a baseline for understanding the significance of terms.
- TF-IDF was introduced in the 1970s by Gerard Salton and his collaborators. It gained broader use in the 1990s with the rise of web search engines, where it became a fundamental concept for ranking search results based on keyword relevance.
- Gerard Salton, a pioneering figure in information retrieval, is most often credited with the development of TF-IDF. His work laid the foundation for many modern text retrieval systems, including the famous SMART information retrieval system.