Duplicate Content Checker Using Longest Matching Subsequences With Python

How To Make a Duplicate Content Checker Using Python

In this post, we’re going to make a duplicate content checker algorithm with Python programming language. Furthermore, this algorithm will compare two texts and find the how similar they are by using longest matching subsequence technique.

This tool can be useful for checking product descriptions on e-commerce stores that sell products from another supplier. For example, one such supplier are the print on demand (POD) services. Moreover, these services usually have prewritten product descriptions, which you can use on your website.

However, I strongly suggest you write your own and tailor them for your brand and make them unique. This is because many people use POD services and all of them have access to the same product descriptions. Therefore, you would have a really hard time ranking it, since it’s probably been copied before.

Coding duplicate content checker

For the sake of this demo project, I’m going to use a sample dataset from Kaggle, containing product descriptions. Furthermore, we’ll compare these descriptions to one another and check if there are any duplicates or near-duplicates.

But first of all, before we start laying on the logic, we need to import all the necessary modules and methods.

import os
import pandas as pd
from tqdm import tqdm
from difflib import SequenceMatcher
from kaggle.api.kaggle_api_extended import KaggleApi

We’re going to fetch our sample dataset by using Kaggle API and handle all of it inside the script. In case you’re not familiar with this process, I suggest you check out my guide for downloading Kaggle datasets.

Moreover, we’re going to make a method for handling this part along with importing the .csv file and returning it as a pandas dataframe object.

def get_data():
    api = KaggleApi()
    api.authenticate()

    api.dataset_download_file(
        'cclark/product-item-data',
        file_name='sample-data.csv',
        path=ROOT
    )

    df = pd.read_csv(os.path.join(ROOT, 'sample-data.csv'))

    return df

Next, we’re going to make a method that will actually find near duplicates. Even more, the following method will output a dataframe table with lengths of longest matching subsequences.

def find_near_duplicates(df, target):
    output = pd.DataFrame(columns=['id', 'LMS', 'identical'])

    for _, row in tqdm(df.iterrows()):

        text = row['description']

        s = SequenceMatcher(None, text, target, autojunk=False)
        result = s.find_longest_match(0, len(text), 0, len(target))

        product = {
            'id': row['id'],
            'LMS': result[2],
            'identical': (result[2] / int(len(text))) * 100
        }

        output = pd.concat([output, pd.DataFrame([product])], ignore_index=True)
    
    return output.sort_values(by='LMS', ascending=False)

And finally, we need to put it in action. Remember, the method for finding near duplicates will only compare one product description to all others. Therefore, we need to repeat it for every product description.

Here is where things can get inefficient in case if you have a large database of product descriptions. So, I would suggest you run this duplicate content checker only for similar items, like different types of t-shirts.

In our case here, I ran it only for the first 30 product descriptions in the dataset.

if __name__ == '__main__':
    
    data = get_data()
    data = data[:30]

    for _, row in data.iterrows():
        desc = row['description']
        result = find_near_duplicates(data, desc)
        print(result.head())

Conclusion

To conclude, we made a simple duplicate content checker with Python to find out how similar are product descriptions in a sample dataset. I learned a lot while working on this project and I hope you’ll find it helpful as well.