web scraping tutorial with text summarization

Tutorial: How To BeautifulSoup For Web Scraping Using Python

In this tutorial, we’ll take a look at how to make a web scraping script with Beautiful Soup using Python. Additionally, we’ll scrape the content from featured posts on Coin Telegraph and summarize its contents.

Therefore, we’re not only going to use web scraping in this tutorial, but also advanced summarization techniques using OpenAI’s API. But before we get to that, let’s focus on the web scraping task first.

Before we begin with the project, you should install Beautiful Soup, in case haven’t already. You can do so with pip with the following command: pip install beautifulsoup4.

Coding the web scraping script

For this project, we’ll need to setup a couple of methods, as we’ll need them throughout the project. These include a method for loading and saving data from json files, and a method for fetching the BeautifulSoup object.

BeatifulSoup object is basically the HTML from the URL address we specify to it. Furthermore, we can use functions of this class to access the elements from the HTML document.

First of all, like with any other python project, let’s import all the necessary libraries, including the ones from LangChain for the summarization task we’ll do later.

import os
import json
import httplib2
from bs4 import BeautifulSoup
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document

Because, we’ll be working with LangChain, which requires of us to provide an API key, we’ll also create a simple auth.json file, where we’ll store it. Furthermore, we’ll make a function that will fetch that key in our script. This way, we don’t need to reveal it, in case we want to share it with anyone else.

ROOT = os.path.dirname(__file__)

def get_token(token_name):
    with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
        auth_data = json.load(auth_file)
        token = auth_data[token_name]
        return token

os.environ['OPENAI_API_KEY'] = get_token('openai-token')

We’ll also need a place to store the data, we’ll be scraping and summarizing. Therefore, we’ll create a data folder inside our project directory.

DATA = os.path.join(ROOT, 'data')

if not os.path.exists(DATA):
    os.mkdir(DATA)

Next, we need to write functions that will load and save the data inside files in the data directory.

def load_data(data_path):
    try:
        with open(os.path.join(DATA, data_path), 'r') as data_file:
            data = json.load(data_file)
            return data
    except:
        return

def save_data(data, data_path):
    with open(os.path.join(DATA, data_path), 'w') as data_file:
        json.dump(data, data_file, indent=4)

And lastly, as far as our utility functions go, we need to make a function for getting that BeautifulSoup object.

def get_soup(url):
    http = httplib2.Http()
    status, response = http.request(url)
    soup = BeautifulSoup(response, 'html.parser')
    return soup

Okay, now for the web scraping part of this tutorial, we’ll create a function that will take the link of Coin Telegraph homepage, get links from the featured posts and scrape text content from each article. Additionally, this process includes filtering paragraph tags by removing links and images.

def scrape_news():
    url = 'https://cointelegraph.com'
    soup = get_soup(url)
    latest_articles = soup.find_all(class_='main-news-controls__item')

    extracted_text = {}

    for article in latest_articles:
        href = article.find('a')['href']
        if 'https' in href:
            continue
        else:
            link = url + href

        article_soup = get_soup(link)
        try:
            content = article_soup.find(class_='post-content')
            p_tags = content.find_all('p')
            text = ''
            for p in p_tags:
                for a in p.find_all('a'):
                    a.extract()
                for img in p.find_all('img'):
                    img.extract()
                text += p.get_text(strip=True) + '\n'
            extracted_text[link] = text
        except:
            return
    
    print(f'Extracted text from {len(extracted_text)} articles from COINTELEGRAPH.')

    return extracted_text

The preceding function returns a dictionary, where the article’s link is the key and the actual text from it is the value.

Bonus: Summarize text with LangChain

The following function will use summarization chain to take the whole text from an article and extract a short meaningful summary from it.

In case we want to run this script multiple times, we’ll also add functionality for checking if the summary for a given article already exists, and if it does it doesn’t process it again. By doing so, we can use the OpenAI effectively and not waste our money on articles which we already processed.

def write_summaries():

    data_path = os.path.join(DATA, 'crypto_article_summaries.json')
    extracted_text = scrape_news()

    if not extracted_text:
        return

    existing_data = load_data(data_path)
    if not existing_data:
        existing_data = {}

    llm = ChatOpenAI(model='gpt-3.5-turbo-0613')
    chain = load_summarize_chain(llm, chain_type='map_reduce')
    
    count = 0
    for link in extracted_text:
        if len(existing_data) > 0:
            if link in existing_data:
                continue

        try:    
            existing_data[link] = chain.run([Document(page_content=extracted_text[link])])
            count += 1
        except:
            continue

        
    print(f'Saved {count} new summaries.')
    save_data(existing_data, data_path)

And there you have it, we can finally run the write_summaries() function, which will also run the web scraping function.

Conclusion

To conclude, we wrote a simple web scraping script to fetch text content from articles and summarize them. I learned a lot while working on this project and I hope it proves to be helpful to you aswell.

Share this article:

Related posts

Discussion(0)