TF-IDF stands for frequency–inverse document frequency, a formula that measures the importance of a word appearing in a document within a corpus. This measure calculates the number of times a word appears in a text (term frequency) and compares it to the inverse of the proportion of documents in the corpus that contain the term (i.e., the rarity or frequency of that word).

Multiplying these two quantities gives a TF-IDF score. The higher the score, the more relevant the word is to the document.

When it comes to keyword extraction, this metric can help you identify the most relevant words in a piece of content (those that scored the highest) and consider them as keywords. This can be particularly useful for tasks like tagging support tickets or analyzing customer feedback.

In most of these cases, the words that appear most frequently in a set of documents are not necessarily the most relevant. Similarly, a word that appears in a single text, but does not appear in other documents, may be very important for understanding the content of that text .

TF IDF for SEO?

Search engines sometimes use the TF-IDF model in addition to other factors.

Does the TF-IDF method provide enough information to optimize your content writing  ? Not at all.

This methodology is over 50 years old and plays a very limited role in the operation of Google’s search algorithms . It is not cutting-edge technology.

To find out more, you can consult the article dedicated to TF IDF .

RAKE

Rapid Automatic Keyword Extraction (RAKE) is a well-known keyword extraction method that uses a list of stopwords and phrases as “delimiters” to detect the most relevant words or phrases in a text.

Take the following text as an example  :

Following the invasion of Stargate by aliens, Colonel Jack O’Neill is called to the rescue. Stargate SG-1 is then formed and sent to explore all these new worlds.

The first thing the method does is divide the text into a list of words and remove stop words from this list. This results in a list containing so-called  content words .

Let’s say our list of keywords and phrases looks like this:

Our list of 8 content words will look like this:


Then, the algorithm splits the text based on phrases and stop words to create phrases . In our case, the key phrases would be:


After dividing the text in this way, the algorithm creates a table of word co-occurrences . Each row indicates the number of times a given content word co-occurs with another content word

Once this table is constructed, the words are rated. The ratings correspond to the  number of times a word appears in the table (i.e., the sum of the number of times the word co-occurs with any other content word). This is therefore the word frequency (i.e., the number of times the word appears in the text).

 

Scroll to Top