How To Find Internal And External Links In Posts With Python
In this post, we’ll make a Python algorithm, which is going to find all, internal and external, links from posts. Furthermore, we’ll store it in a pandas dataframe, so we can further analyze and save the data.
The purpose of this extraction and analysis is to find opportunities to improve the SEO performance of a post, and therefore also of the entire website. Moreover, the purpose of internal linking is to provide a visitor with the best user experience by recommending related content within the post that they landed on.
Even more, with this tool, we can find the posts that don’t include any internal links, which will be a good starting point of making SEO improvements.
Prerequisites
First of all, like with any Python project here, we need to import all the necessary modules and tools. Furthermore, this project will involve getting, parsing, and storing HTML DOM data from a website.
We’ll utilize a well known web scraping library BeautifulSoup to get the links from posts, which is going to be the heart of this project. You can use the following command to install it using pip.
pip install beautifulsoup4
And here are all the modules we’ll need for this project.
import requests
import pandas as pd
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from tqdm import tqdm
We’re also going to set a maximum display width for columns in pandas dataframes. The purpose of this is, so we’ll be able to see whole links when we print out the dataframe of them and their respective information.
pd.set_option('max_colwidth', 100)
Get URL addresses of all posts
Before we can begin scraping posts, we first need to get URLs of all the posts on the blog. Moreover, we can achieve this by scraping the posts sitemap.
However, I discovered that if we simply scraped all the links from this sitemap, we’ll also end up with image links for each post as well. Therefore, we need to exclude these links from those we store inside a list of post URLs.
The following snippet shows a method that accomplishes this and returns a list of all post URLs.
def get_sitemap_links(url, all_links = []):
response = requests.get(url)
if response.status_code == 200:
try:
soup = BeautifulSoup(response.text, 'xml')
links = [loc.text for loc in soup.find_all('loc') if 'wp-content' not in loc.text]
except:
return
else:
for link in links:
if link[-3:] != 'xml':
all_links.append(link)
return all_links
else:
print(f'Request failed with status code {response.status_code}')
return
Scraping post content
Once we have links to all the posts, we can start fetching requests and scraping each page for links. The following snippet will make a GET request to the page and create a BeautifulSoup object, which we’ll use to find the links later.
def get_soup(url):
response = requests.get(url)
try:
soup = BeautifulSoup(response.content, 'html.parser')
return soup
except:
return
Next, we’ll write a couple of methods to extract information from this object, which we’ll use to populate a dataframe. Furthermore, one method will fetch the title of the post, while the other is going to handle the links.
def get_title(soup):
try:
return soup.find('title').text
except:
return
def get_post_links(soup, class_):
articles = soup.find_all(class_=class_)
links = []
for article in articles:
links_elms = article.find_all('a')
for link in links_elms:
links.append(link['href'])
links = list(set(links))
return links
Since the method above will return all the links, we still need to sort them between internal and external links. The following method will check whether the base domain of the link is the one of the website we’re scraping (internal), or if it’s from another website (external).
def is_internal(url, domain):
if (not bool(urlparse(url).netloc)):
return True
else:
if domain in url:
return True
return False
Alright, now that we have all the functions to fetch the data we’ll use to populate the pandas dataframe with, we need to put it all together. The following snippet will create a dataframe of all this data and return it.
def fetch_links_dataframe(sitemap_url):
domain = urlparse(sitemap_url).netloc
df_links = pd.DataFrame(columns=[
'url',
'title',
'post_link_href',
'internal'
])
sitemap_links = get_sitemap_links(sitemap_url)
data = {}
for link in tqdm(sitemap_links):
soup = get_soup(link)
link_srcs = get_post_links(soup, 'prose')
for src in link_srcs:
data['url'] = link
data['title'] = get_title(soup)
data['post_link_href'] = src
data['internal'] = is_internal(src, domain)
df_links = pd.concat([df_links, pd.DataFrame.from_dict([data])])
df_links.reset_index(inplace=True)
df_links.pop('index')
return df_links
Filtering internal and external links
We’re finally ready to start filtering links and finding which among all of them are internal or external. Furthermore, each link will also be accompanied by the URL of the post it appears on. Therefore, when we’ll be searching for posts that don’t include any internal links, we’ll know which post to fix.
First, let’s fetch the data and store it in a pandas dataframe.
url = 'https://ak-codes.com/post-sitemap.xml'
df = fetch_links_dataframe(url)
Next, we can filter this data for each condition we described above.
# find only internal links
df_internal = df[df['internal'] == True]
print(df_internal.head(20))
print('Number of total internal links:', df_internal['url'].count())
print('Number of unique internal links:', df_internal['url'].nunique())
# find only external links
df_external = df[df['internal'] == False]
print(df_external.head(20))
print('Number of total external links:', df_external['url'].count())
print('Number of unique external links:', df_external['url'].nunique())
# find pages without any internal links
df_no_links = df[df['internal'].isnull()]
print(df_no_links.head(20))
print('Number of pages without any links:', df_no_links['url'].nunique())
Entire project code
Below is the entire project code, which you can also find in a GitHub repository, that I created for it.
import requests
import pandas as pd
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from tqdm import tqdm
pd.set_option('max_colwidth', 100)
def get_sitemap_links(url, all_links = []):
response = requests.get(url)
if response.status_code == 200:
try:
soup = BeautifulSoup(response.text, 'xml')
links = [loc.text for loc in soup.find_all('loc') if 'wp-content' not in loc.text]
except:
return
else:
for link in links:
if link[-3:] != 'xml':
all_links.append(link)
return all_links
else:
print(f'Request failed with status code {response.status_code}')
return
def get_soup(url):
response = requests.get(url)
try:
soup = BeautifulSoup(response.content, 'html.parser')
return soup
except:
return
def get_title(soup):
try:
return soup.find('title').text
except:
return
def get_post_links(soup, class_):
articles = soup.find_all(class_=class_)
links = []
for article in articles:
links_elms = article.find_all('a')
for link in links_elms:
links.append(link['href'])
links = list(set(links))
return links
def is_internal(url, domain):
if (not bool(urlparse(url).netloc)):
return True
else:
if domain in url:
return True
return False
def fetch_links_dataframe(sitemap_url):
domain = urlparse(sitemap_url).netloc
df_links = pd.DataFrame(columns=[
'url',
'title',
'post_link_href',
'internal'
])
sitemap_links = get_sitemap_links(sitemap_url)
data = {}
for link in tqdm(sitemap_links):
soup = get_soup(link)
link_srcs = get_post_links(soup, 'prose')
for src in link_srcs:
data['url'] = link
data['title'] = get_title(soup)
data['post_link_href'] = src
data['internal'] = is_internal(src, domain)
df_links = pd.concat([df_links, pd.DataFrame.from_dict([data])])
df_links.reset_index(inplace=True)
df_links.pop('index')
return df_links
if __name__ == '__main__':
url = 'https://ak-codes.com/post-sitemap.xml'
df = fetch_links_dataframe(url)
# find only internal links
df_internal = df[df['internal'] == True]
print(df_internal.head(20))
print('Number of total internal links:', df_internal['url'].count())
print('Number of unique internal links:', df_internal['url'].nunique())
# find only external links
df_external = df[df['internal'] == False]
print(df_external.head(20))
print('Number of total external links:', df_external['url'].count())
print('Number of unique external links:', df_external['url'].nunique())
# find pages without any internal links
df_no_links = df[df['internal'].isnull()]
print(df_no_links.head(20))
print('Number of pages without any links:', df_no_links['url'].nunique())
Conclusion
To conclude, we made a simple web scraping algorithm to fetch and filter internal and external links using Python. I learned a lot while working on this project and I hope you’ll find it helpful as well.