Home » How To Use Cosine Similarity To Check For Text Similarity

How To Use Cosine Similarity To Check For Text Similarity

In this post, we’ll make a text similarity checker script using Python. We’re also going to use OpenAI API for text summarization and generating embeddings.

For demonstration purposes, we’re going to scrape texts from 2 articles (CNN, NBCNEWS), which both talk about Twitter rebranding. Our goal for this example is to get a cosine similarity score as high as possible.

Whether to gauge if 2 articles talk about the same thing, we’d need to set a threshold manually. In other words, we’d need to find this threshold with trial and error technique. I’d suggest that you set it to a round number, since the cosine similarity score might fluctuate.

However, this is not something we’re going to cover in this tutorial. This simple text similarity checking script we’ll make will only compare the summaries and calculate the cosine similarity score.

Setting up

First of all, like with all other Python projects, we need to import all the necessary libraries and methods. There are a couple of different tasks we’ll need it for, including web scraping, text summarization, embedding generation, and cosine similarity calculation.

import os
import json
import httplib2
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

from bs4 import BeautifulSoup
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings

Because we’re going to use OpenAI API, we’ll need to use an API key, which we can only get with a paid plan on their website. The following function will read the API key from a local file and return its value. The purpose of this is to not reveal it inside the script, in case you want to share this code with other people.

ROOT = os.path.dirname(__file__)

def get_token(token_name):
    with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
        auth_data = json.load(auth_file)
        token = auth_data[token_name]
        return token

os.environ['OPENAI_API_KEY'] = get_token('openai-token')

Next, we’ll need a couple of functions for web scraping. The following methods will return the HTML markup and extract the text from it.

def get_soup(url):
    http = httplib2.Http()
    status, response = http.request(url)
    soup = BeautifulSoup(response, 'html.parser')
    return soup

def filter_content(content):
    p_tags = content.find_all('p')
    text = ''

    for p in p_tags:
        for link in p.find_all('a'):
            link.extract()
        for img in p.find_all('img'):
            img.extract()
        text += p.get_text(strip=True) + '\n'
    
    return text

Okay, now we’ll need to setup the tools from LangChain library for summarization and getting embeddings from text.

llm = ChatOpenAI(model='gpt-3.5-turbo-0613')
chain = load_summarize_chain(llm, chain_type='map_reduce')
embeddings = OpenAIEmbeddings()

Great, we have everything setup for our text similarity checker now. To demonstrate it, I’m going to create a dictionary with links to the 2 articles and classes of elements with content inside.

articles_data = {
    'cnn': {
        'url': 'https://edition.cnn.com/2023/07/24/tech/twitter-rebrands-x-elon-musk-hnk-intl/index.html',
        'class': 'article__content'
    },
    'nbcnews': {
        'url': 'https://www.nbcnews.com/news/us-news/twitter-rebrands-x-elon-musk-loses-iconic-bird-logo-rcna95880',
        'class': 'article-body__content'
    }
}

Coding the text similarity checker

The following code will loop through the elements we set inside the articles_data dictionary. So you can easily upscale this to check between multiple links. However, we won’t be calculating the cosine similarity inside the loop, we’re only going to get all the necessary data and store it.

for article in articles_data:
    article_soup = get_soup(articles_data[article]['url'])
    text = filter_content(article_soup.find(class_=articles_data[article]['class']))

    summary = chain.run([Document(page_content=text)])
    print(f'Article summary:\n{summary}\n')

    articles_data[article]['embeddings'] = np.array(embeddings.embed_documents([summary]))

NOTE: To clarify, why we need the summarization step, is to shorten the texts and therefore embeddings arrays. This will accelerate the computation time, while maintaining the information about what articles are talking about.

And for the final step, we calculate the cosine similarity between the embeddings arrays from articles.

cosine = cosine_similarity(
    articles_data['cnn']['embeddings'], 
    articles_data['nbcnews']['embeddings']
)

print('Cosine similarity score:', np.squeeze(cosine))

The following is the output.

Cosine similarity score: 0.9453810036291601

Here is also the entire code of the project.

import os
import json
import httplib2
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

from bs4 import BeautifulSoup
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings

ROOT = os.path.dirname(__file__)

def get_token(token_name):
    with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
        auth_data = json.load(auth_file)
        token = auth_data[token_name]
        return token

os.environ['OPENAI_API_KEY'] = get_token('openai-token')

def get_soup(url):
    http = httplib2.Http()
    status, response = http.request(url)
    soup = BeautifulSoup(response, 'html.parser')
    return soup

def filter_content(content):
    p_tags = content.find_all('p')
    text = ''

    for p in p_tags:
        for link in p.find_all('a'):
            link.extract()
        for img in p.find_all('img'):
            img.extract()
        text += p.get_text(strip=True) + '\n'
    
    return text

llm = ChatOpenAI(model='gpt-3.5-turbo-0613')
chain = load_summarize_chain(llm, chain_type='map_reduce')
embeddings = OpenAIEmbeddings()

articles_data = {
    'cnn': {
        'url': 'https://edition.cnn.com/2023/07/24/tech/twitter-rebrands-x-elon-musk-hnk-intl/index.html',
        'class': 'article__content'
    },
    'nbcnews': {
        'url': 'https://www.nbcnews.com/news/us-news/twitter-rebrands-x-elon-musk-loses-iconic-bird-logo-rcna95880',
        'class': 'article-body__content'
    }
}

for article in articles_data:
    article_soup = get_soup(articles_data[article]['url'])
    text = filter_content(article_soup.find(class_=articles_data[article]['class']))

    summary = chain.run([Document(page_content=text)])
    print(f'Article summary:\n{summary}\n')

    articles_data[article]['embeddings'] = np.array(embeddings.embed_documents([summary]))


cosine = cosine_similarity(
    articles_data['cnn']['embeddings'], 
    articles_data['nbcnews']['embeddings']
)

print('Cosine similarity score:', np.squeeze(cosine))

Conclusion

To conclude, we made a simple Python script for checking text similarity between 2 articles. I learned a lot while working on this project and I hope this guide proves itself helpful to you as well.