TF IDF Analysis With Python

How To Do TF IDF Analysis On WordPress Websites With Python

In this post, we’re going to perform a TF IDF analysis on WordPress blog posts using Python. This is one of the oldest methods of understanding what an article is talking about. Furthermore,

Furthermore, TF stands for Term Frequency and IDF stands for Inverse Document Frequency. Basically, it’s useful for extracting topical phrases while ignoring stop words.

However useful this might be, I’d still recommend to any copywriter out there to focus on creating user-friendly content rather than optimizing to satisfy the numbers.

Extracting TF IDF Scores With Python

Alright, let’s get to the coding part of this guide. Furthermore, we’ll make a python script that will extract TF IDF scores from WordPress blogs. I’ll also explain along the way what each part of the code does and how it plays into the whole analysis.

First of all, like with any other python project, we need to import all the necessary modules. We’ll also set a constant for the file path of the script, which we’ll use later on for outputing data into an Excel file.

import os
import requests
import argparse
import math
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from textblob import TextBlob as tb
from urllib.parse import urlparse

ROOT = os.path.dirname(__file__)

Fetching data

Alright, now that we have all our tools on the table, we need to fetch the data from WordPress. Because WordPress already has a built in endpoint, we can use a GET request to fetch posts content. Furthermore, we don’t need any credentials to get the data, because we’re fetching publicly available data.

The following method fetches posts data, sorts it by date and outputs a dictionary containing link of the post and its content. However, we’ll need to process the content information even further later on.

def get_posts(url):

    api_url = url + 'wp-json/wp/v2/posts'
    posts = []
    page = 1

    while True:
        response = requests.get(api_url, params={
            'page': page,
            'per_page': 100
        })

        if not response.status_code == 200:
            break
            
        posts += response.json()
        page += 1
    
    extracted = {}
    for p in posts:
        extracted[datetime.strptime(p['date'], '%Y-%m-%dT%H:%M:%S')] = {
            'link': p['link'],
            'content': p['content']['rendered'],
        }
    
    keys = list(extracted.keys())
    keys.sort(reverse=True)
    extracted = {i: extracted[i] for i in keys}
    
    return extracted

TF IDF Utility Methods

Next. we’ll define some methods, which we’ll use to calculate the TF-IDF scores later when we put everything together.

First, we’ll define a method for calculating term frequency (TF), which gives us the fraction of how many times a certain word is used in the article. The following method returns a number of how many times a certain word appears in the text, divided by a total number of words inside that text.

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

Next utility method we’ll define is one for counting how many articles contain a certain word. We’ll need this function to calculate the Inverse Document Frequency (IDF).

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

With that, we can now calculate the IDF, which the following function takes care of.

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

And finally, we can use all of the methods we defined above to calculate the TF-IDF score.

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

Putting it all to use

Now that we have everything ready, we can piece the whole algorithm together. Furthermore, we’ll use an argument parser, so the script can except the URL address of the WordPress blog when we use it with a command prompt.

So, if we get right into it, let’s first define the argument parser and variables that we’ll use later on.

parser = argparse.ArgumentParser(description='Extract relevant keywords using TF-IDF analysis.')
    parser.add_argument('url', help='Supply WordPress website url.')
    args = parser.parse_args()
    url = args.url
    base = urlparse(url).netloc

Next, we’re going to fetch the posts information and process the content, extracting the text from p tags.

posts = get_posts(url)
    for p in posts:
        content = ''
        soup = BeautifulSoup(posts[p]['content'], 'html.parser')
        for x in soup.find_all('p'):
            content += ' ' + x.text.lower()
        
        posts[p]['content'] = content

Now that we have our data processed and ready, we can calculate the TF-IDF scores and format the data for outputing it into an Excel file.

list_content = [tb(posts[p]['content']) for p in posts]
    list_words_scores = []
    for p in posts:
        blob = tb(posts[p]['content'])
        scores = {word: tfidf(word, blob, list_content) for word in blob.words}
        sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        for word, score in sorted_words[:5]:
            list_words_scores.append([posts[p]['link'], word, score])

And finally, we can output the data and generate the Excel file containing it.

df = pd.DataFrame(list_words_scores)
    df.to_excel(os.path.join(ROOT, f'{base} - TFIDF.xlsx'), header=['URL', 'Word', 'TF-IDF score'], index=False)

Here is also the entire code of the project.

import os
import requests
import argparse
import math
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from textblob import TextBlob as tb
from urllib.parse import urlparse

ROOT = os.path.dirname(__file__)

def get_posts(url):

    api_url = url + 'wp-json/wp/v2/posts'
    posts = []
    page = 1

    while True:
        response = requests.get(api_url, params={
            'page': page,
            'per_page': 100
        })

        if not response.status_code == 200:
            break
            
        posts += response.json()
        page += 1
    
    extracted = {}
    for p in posts:
        extracted[datetime.strptime(p['date'], '%Y-%m-%dT%H:%M:%S')] = {
            'link': p['link'],
            'content': p['content']['rendered'],
        }
    
    keys = list(extracted.keys())
    keys.sort(reverse=True)
    extracted = {i: extracted[i] for i in keys}
    
    return extracted

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description='Extract relevant keywords using TF-IDF analysis.')
    parser.add_argument('url', help='Supply WordPress website url.')
    args = parser.parse_args()
    url = args.url
    base = urlparse(url).netloc
    
    posts = get_posts(url)
    for p in posts:
        content = ''
        soup = BeautifulSoup(posts[p]['content'], 'html.parser')
        for x in soup.find_all('p'):
            content += ' ' + x.text.lower()
        
        posts[p]['content'] = content
    
    list_content = [tb(posts[p]['content']) for p in posts]
    list_words_scores = []
    for p in posts:
        blob = tb(posts[p]['content'])
        scores = {word: tfidf(word, blob, list_content) for word in blob.words}
        sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        for word, score in sorted_words[:5]:
            list_words_scores.append([posts[p]['link'], word, score])

    df = pd.DataFrame(list_words_scores)
    df.to_excel(os.path.join(ROOT, f'{base} - TFIDF.xlsx'), header=['URL', 'Word', 'TF-IDF score'], index=False)

Conclusion

To conclude, we made a python script for calculating TF IDF scores for WordPress blogs. Even more, I’d like to note that the more data it takes into account, more accurate it will be.

I learned a lot while working on this project and I hope you’ll find it helpful as well. I would also appreciate it if you considered sharing this post with others and following my blog to be notified when publish a post in the future.