Tutorial: How To BeautifulSoup For Web Scraping Using Python
In this tutorial, we’ll take a look at how to make a web scraping script with Beautiful Soup using Python. Additionally, we’ll scrape the content from featured posts on Coin Telegraph and summarize its contents.
Therefore, we’re not only going to use web scraping in this tutorial, but also advanced summarization techniques using OpenAI’s API. But before we get to that, let’s focus on the web scraping task first.
Before we begin with the project, you should install Beautiful Soup, in case haven’t already. You can do so with pip with the following command: pip install beautifulsoup4
.
Coding the web scraping script
For this project, we’ll need to setup a couple of methods, as we’ll need them throughout the project. These include a method for loading and saving data from json files, and a method for fetching the BeautifulSoup object.
BeatifulSoup object is basically the HTML from the URL address we specify to it. Furthermore, we can use functions of this class to access the elements from the HTML document.
First of all, like with any other python project, let’s import all the necessary libraries, including the ones from LangChain for the summarization task we’ll do later.
import os
import json
import httplib2
from bs4 import BeautifulSoup
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
Because, we’ll be working with LangChain, which requires of us to provide an API key, we’ll also create a simple auth.json file, where we’ll store it. Furthermore, we’ll make a function that will fetch that key in our script. This way, we don’t need to reveal it, in case we want to share it with anyone else.
ROOT = os.path.dirname(__file__)
def get_token(token_name):
with open(os.path.join(ROOT, 'auth.json'), 'r') as auth_file:
auth_data = json.load(auth_file)
token = auth_data[token_name]
return token
os.environ['OPENAI_API_KEY'] = get_token('openai-token')
We’ll also need a place to store the data, we’ll be scraping and summarizing. Therefore, we’ll create a data folder inside our project directory.
DATA = os.path.join(ROOT, 'data')
if not os.path.exists(DATA):
os.mkdir(DATA)
Next, we need to write functions that will load and save the data inside files in the data directory.
def load_data(data_path):
try:
with open(os.path.join(DATA, data_path), 'r') as data_file:
data = json.load(data_file)
return data
except:
return
def save_data(data, data_path):
with open(os.path.join(DATA, data_path), 'w') as data_file:
json.dump(data, data_file, indent=4)
And lastly, as far as our utility functions go, we need to make a function for getting that BeautifulSoup object.
def get_soup(url):
http = httplib2.Http()
status, response = http.request(url)
soup = BeautifulSoup(response, 'html.parser')
return soup
Okay, now for the web scraping part of this tutorial, we’ll create a function that will take the link of Coin Telegraph homepage, get links from the featured posts and scrape text content from each article. Additionally, this process includes filtering paragraph tags by removing links and images.
def scrape_news():
url = 'https://cointelegraph.com'
soup = get_soup(url)
latest_articles = soup.find_all(class_='main-news-controls__item')
extracted_text = {}
for article in latest_articles:
href = article.find('a')['href']
if 'https' in href:
continue
else:
link = url + href
article_soup = get_soup(link)
try:
content = article_soup.find(class_='post-content')
p_tags = content.find_all('p')
text = ''
for p in p_tags:
for a in p.find_all('a'):
a.extract()
for img in p.find_all('img'):
img.extract()
text += p.get_text(strip=True) + '\n'
extracted_text[link] = text
except:
return
print(f'Extracted text from {len(extracted_text)} articles from COINTELEGRAPH.')
return extracted_text
The preceding function returns a dictionary, where the article’s link is the key and the actual text from it is the value.
Bonus: Summarize text with LangChain
The following function will use summarization chain to take the whole text from an article and extract a short meaningful summary from it.
In case we want to run this script multiple times, we’ll also add functionality for checking if the summary for a given article already exists, and if it does it doesn’t process it again. By doing so, we can use the OpenAI effectively and not waste our money on articles which we already processed.
def write_summaries():
data_path = os.path.join(DATA, 'crypto_article_summaries.json')
extracted_text = scrape_news()
if not extracted_text:
return
existing_data = load_data(data_path)
if not existing_data:
existing_data = {}
llm = ChatOpenAI(model='gpt-3.5-turbo-0613')
chain = load_summarize_chain(llm, chain_type='map_reduce')
count = 0
for link in extracted_text:
if len(existing_data) > 0:
if link in existing_data:
continue
try:
existing_data[link] = chain.run([Document(page_content=extracted_text[link])])
count += 1
except:
continue
print(f'Saved {count} new summaries.')
save_data(existing_data, data_path)
And there you have it, we can finally run the write_summaries()
function, which will also run the web scraping function.
Conclusion
To conclude, we wrote a simple web scraping script to fetch text content from articles and summarize them. I learned a lot while working on this project and I hope it proves to be helpful to you aswell.