USING AI FOR KEYWORD EXTRACTION

How To Use AI For Keyword Extraction With Python

Ever wondered how to leverage the state of the art artificial intelligence capabilities for keyword extraction?

In this post, we’re going to make a simple, yet powerful Python script for extracting relevant keywords from websites. Furthermore, we’re going to use the GPT 3.5 Turbo model to do so. In case you’re not familiar with it, it’s the same model as what the free version of ChatGPT uses.

This kind of tool can come in handy when we’re doing research on what other people posted online about the topic you’re writing about. Therefore, we’re also going to include a Custom Search API capabilities in our script.

This way, we’ll not only extract keywords, but also fetch the first 10 search results for them.

Prerequisites

Before we begin coding, we need to set up a couple of things first. First of all, you’ll need to set up access to APIs we’ll use.

We’ll start by getting access to OpenAI API

  1. Create account on OpenAI official website
  2. Verify your account and create an API key
  3. Add funds to your account (ChatGPT might be free, but API isn’t)
  4. Save the API key in a safe place

Next on the list of APIs is Custom Search API

  1. Create a programmable search engine from Google
  2. Copy the ID of that search engine into a safe place
  3. Create a project on Google Cloud Console
  4. Go under Credentials and create an API key
  5. Copy that API key into a safe place
  6. Search for Custom Search API in the API Library and Enable it

And lastly, you’ll need to install a few Python packages. You can do so by using the following pip commands.

pip install beautifulsoup4
pip install google-api-python-client
pip install langchain
pip install langchain-core
pip install langchain-openai

Let’s get coding

Okay, so before we go straight into code, let me give you a quick rundown how it’s all going to work.

  1. Scrape text from a webpage and store it into a variable
  2. Let the GPT model do it’s magic and return a structured output – a list of relevant keywords
  3. Input the keywords into a Custom Search API to get search results
  4. Filter the search results and get the links to the websites

Sounds like a plan, so let’s start and import all the necessary modules and tools we’ll need for that.

import os
import requests
import argparse
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from googleapiclient.discovery import build
from pprint import pprint

Then, you’ll need to create a .env file inside your project folder and copy values of your OpenAI API key, Google Cloud Console API key, and search engine ID. Should look something like the following snippet.

OPENAI_API_KEY=you OpenAI API key
CUSTOM_SEARCH_API_KEY=you Google Cloud Console API key
CUSTOM_SEARCH_ENGINE_ID=your search engine ID

And in order for us to use those values, we need to use the following method.

load_dotenv()

Part 1: Web scraping

First step to the entire process is to get the text on which we’ll perform the keyword extraction. Furthermore, we’ll get that text from a webpage. Therefore, we need to write a couple of methods that will scrape the information that we need.

For this purpose, we’ll use a popular webscraping module called BeautifulSoup. So in order to scrape the data, we need to fetch the contents of HTML DOM and store it in a BeautifulSoup object.

def get_soup(url):
    response = requests.get(url)
    try:
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    except:
        return

We’ll need to filter this data further by excluding images and links, since we’re here just for the text. We can do so by inputing the BeautifulSoup object into the following method.

def scrape_content(body):
    p_tags = body.find_all('p')
    text = []
    for p in p_tags:
        for link in p.find_all('a'):
            link.extract()
        for img in p.find_all('img'):
            img.extract()
        text.append(p.get_text(strip=False))
    
    return ' '.join(text)

Part 2: Keyword Extraction

This is the part where we use the GPT model to do the heavy lifting for us. Furhtermore, we’ll use an output parser to make sure that the following method returns a list of keywords.

def extract_keywords(text):

    response_schemas = [
        ResponseSchema(name='keywords', description='List of extracted keywords from a string')
    ]

    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

    format_instructions = output_parser.get_format_instructions()
    prompt = PromptTemplate(
        template='extract relevant keywords from the following text that would rank on Google: \n{text}\n{format_instructions}',
        input_variables=['text'],
        partial_variables={'format_instructions': format_instructions}
    )

    model = ChatOpenAI(temperature=0)
    chain = prompt | model | output_parser
    keywords = chain.invoke({'text': text})

    return keywords['keywords']

Part 3: Using Custom Search API

And last, but not least, we’re going to use those keywords to find related search results with Custom Search API.

def get_search_results(query):
    service = build(
        'customsearch',
        'v1',
        developerKey=os.getenv('CUSTOM_SEARCH_API_KEY')
    )

    res = service.cse().list(
        q=query,
        cx=os.getenv('CUSTOM_SEARCH_ENGINE_ID')
    ).execute()

    return res

Part 4: Putting Everything Together

Now that we built all the tools we need, we need to put them to use. We’re also going to utilize the argparser module, so we can use the script straight from the command line.

For demonstration purposes, I’ll only use the first keyword in the list for getting the search results, but you can easily modify it to fetch results for all of them.

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Returns list of search results links'
    )

    parser.add_argument(
        '-u',
        '--url',
        type=str,
        required=True,
        help='URL address from where to scrape information'
    )

    args = parser.parse_args()
    url = args.url

    soup = get_soup(url)
    body = soup.find('body')
    text = scrape_content(body)
    keywords = extract_keywords(text)
    print(keywords)

    res = get_search_results(keywords[0])
    links = [l['link'] for l in res['items']]
    pprint(links)

Alright! Now you’re ready to start doing research on your competition, finding related articles for content you want to know more about, or expand the functionality even further.

Entire code of the project

Here is also the entire code of the project, which you can also find in a GitHub repository I made for it.

import os
import requests
import argparse
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from googleapiclient.discovery import build
from pprint import pprint

load_dotenv()

def get_soup(url):
    response = requests.get(url)
    try:
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    except:
        return
    
def scrape_content(body):
    p_tags = body.find_all('p')
    text = []
    for p in p_tags:
        for link in p.find_all('a'):
            link.extract()
        for img in p.find_all('img'):
            img.extract()
        text.append(p.get_text(strip=False))
    
    return ' '.join(text)


def extract_keywords(text):

    response_schemas = [
        ResponseSchema(name='keywords', description='List of extracted keywords from a string')
    ]

    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

    format_instructions = output_parser.get_format_instructions()
    prompt = PromptTemplate(
        template='extract relevant keywords from the following text that would rank on Google: \n{text}\n{format_instructions}',
        input_variables=['text'],
        partial_variables={'format_instructions': format_instructions}
    )

    model = ChatOpenAI(temperature=0)
    chain = prompt | model | output_parser
    keywords = chain.invoke({'text': text})

    return keywords['keywords']

def get_search_results(query):
    service = build(
        'customsearch',
        'v1',
        developerKey=os.getenv('CUSTOM_SEARCH_API_KEY')
    )

    res = service.cse().list(
        q=query,
        cx=os.getenv('CUSTOM_SEARCH_ENGINE_ID')
    ).execute()

    return res


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Returns list of search results links'
    )

    parser.add_argument(
        '-u',
        '--url',
        type=str,
        required=True,
        help='URL address from where to scrape information'
    )

    args = parser.parse_args()
    url = args.url

    soup = get_soup(url)
    body = soup.find('body')
    text = scrape_content(body)
    keywords = extract_keywords(text)
    print(keywords)

    res = get_search_results(keywords[0])
    links = [l['link'] for l in res['items']]
    pprint(links)

Conclusion

To conclude, we made a simple, yet powerful Python script for performing keyword extraction on webscrapped content. In addition to that, we also used Custom Search API to get serach results for the extracted keywords. I learned a lot while working on this project and I hope you will find it helpful as well.

Share this article:

Related posts

Discussion(0)