Find internal and external links from blog posts with python

How To Find Internal And External Links In Posts With Python

In this post, we’ll make a Python algorithm, which is going to find all, internal and external, links from posts. Furthermore, we’ll store it in a pandas dataframe, so we can further analyze and save the data.

The purpose of this extraction and analysis is to find opportunities to improve the SEO performance of a post, and therefore also of the entire website. Moreover, the purpose of internal linking is to provide a visitor with the best user experience by recommending related content within the post that they landed on.

Even more, with this tool, we can find the posts that don’t include any internal links, which will be a good starting point of making SEO improvements.

Prerequisites

First of all, like with any Python project here, we need to import all the necessary modules and tools. Furthermore, this project will involve getting, parsing, and storing HTML DOM data from a website.

We’ll utilize a well known web scraping library BeautifulSoup to get the links from posts, which is going to be the heart of this project. You can use the following command to install it using pip.

pip install beautifulsoup4

And here are all the modules we’ll need for this project.

import requests
import pandas as pd
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from tqdm import tqdm

We’re also going to set a maximum display width for columns in pandas dataframes. The purpose of this is, so we’ll be able to see whole links when we print out the dataframe of them and their respective information.

pd.set_option('max_colwidth', 100)

Get URL addresses of all posts

Before we can begin scraping posts, we first need to get URLs of all the posts on the blog. Moreover, we can achieve this by scraping the posts sitemap.

However, I discovered that if we simply scraped all the links from this sitemap, we’ll also end up with image links for each post as well. Therefore, we need to exclude these links from those we store inside a list of post URLs.

The following snippet shows a method that accomplishes this and returns a list of all post URLs.

def get_sitemap_links(url, all_links = []):
    response = requests.get(url)
    if response.status_code == 200:
        try:

            soup = BeautifulSoup(response.text, 'xml')
            links = [loc.text for loc in soup.find_all('loc') if 'wp-content' not in loc.text]

        except:
            return
        
        else:

            for link in links:
                if link[-3:] != 'xml':
                    all_links.append(link)
            
            return all_links
    else:
        print(f'Request failed with status code {response.status_code}')
        return

Scraping post content

Once we have links to all the posts, we can start fetching requests and scraping each page for links. The following snippet will make a GET request to the page and create a BeautifulSoup object, which we’ll use to find the links later.

def get_soup(url):
    response = requests.get(url)
    try:
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    except:
        return

Next, we’ll write a couple of methods to extract information from this object, which we’ll use to populate a dataframe. Furthermore, one method will fetch the title of the post, while the other is going to handle the links.

def get_title(soup):
    try:
        return soup.find('title').text
    except:
        return

def get_post_links(soup, class_):
    
    articles = soup.find_all(class_=class_)
    links = []
    for article in articles:
        links_elms = article.find_all('a')
        for link in links_elms:
            links.append(link['href'])

    links = list(set(links))
    return links

Since the method above will return all the links, we still need to sort them between internal and external links. The following method will check whether the base domain of the link is the one of the website we’re scraping (internal), or if it’s from another website (external).

def is_internal(url, domain):
    if (not bool(urlparse(url).netloc)):
        return True
    else:
        if domain in url:
            return True
        return False

Alright, now that we have all the functions to fetch the data we’ll use to populate the pandas dataframe with, we need to put it all together. The following snippet will create a dataframe of all this data and return it.

def fetch_links_dataframe(sitemap_url):
    
    domain = urlparse(sitemap_url).netloc

    df_links = pd.DataFrame(columns=[
        'url',
        'title',
        'post_link_href',
        'internal'
    ])

    sitemap_links = get_sitemap_links(sitemap_url)
    data = {}

    for link in tqdm(sitemap_links):
        soup = get_soup(link)
        link_srcs = get_post_links(soup, 'prose')
        for src in link_srcs:
            data['url'] = link
            data['title'] = get_title(soup)
            data['post_link_href'] = src
            data['internal'] = is_internal(src, domain)
            df_links = pd.concat([df_links, pd.DataFrame.from_dict([data])])
    
    df_links.reset_index(inplace=True)
    df_links.pop('index')

    return df_links

We’re finally ready to start filtering links and finding which among all of them are internal or external. Furthermore, each link will also be accompanied by the URL of the post it appears on. Therefore, when we’ll be searching for posts that don’t include any internal links, we’ll know which post to fix.

First, let’s fetch the data and store it in a pandas dataframe.

url = 'https://ak-codes.com/post-sitemap.xml'
df = fetch_links_dataframe(url)

Next, we can filter this data for each condition we described above.

# find only internal links
df_internal = df[df['internal'] == True]
print(df_internal.head(20))
print('Number of total internal links:', df_internal['url'].count())
print('Number of unique internal links:', df_internal['url'].nunique())

# find only external links
df_external = df[df['internal'] == False]
print(df_external.head(20))
print('Number of total external links:', df_external['url'].count())
print('Number of unique external links:', df_external['url'].nunique())

# find pages without any internal links
df_no_links = df[df['internal'].isnull()]
print(df_no_links.head(20))
print('Number of pages without any links:', df_no_links['url'].nunique())

Entire project code

Below is the entire project code, which you can also find in a GitHub repository, that I created for it.

import requests
import pandas as pd
from urllib.parse import urlparse
from bs4 import BeautifulSoup
from tqdm import tqdm

pd.set_option('max_colwidth', 100)

def get_sitemap_links(url, all_links = []):
    response = requests.get(url)
    if response.status_code == 200:
        try:

            soup = BeautifulSoup(response.text, 'xml')
            links = [loc.text for loc in soup.find_all('loc') if 'wp-content' not in loc.text]

        except:
            return
        
        else:

            for link in links:
                if link[-3:] != 'xml':
                    all_links.append(link)
            
            return all_links
    else:
        print(f'Request failed with status code {response.status_code}')
        return

def get_soup(url):
    response = requests.get(url)
    try:
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    except:
        return
    
def get_title(soup):
    try:
        return soup.find('title').text
    except:
        return

def get_post_links(soup, class_):
    
    articles = soup.find_all(class_=class_)
    links = []
    for article in articles:
        links_elms = article.find_all('a')
        for link in links_elms:
            links.append(link['href'])

    links = list(set(links))
    return links

def is_internal(url, domain):
    if (not bool(urlparse(url).netloc)):
        return True
    else:
        if domain in url:
            return True
        return False
    
def fetch_links_dataframe(sitemap_url):
    
    domain = urlparse(sitemap_url).netloc

    df_links = pd.DataFrame(columns=[
        'url',
        'title',
        'post_link_href',
        'internal'
    ])

    sitemap_links = get_sitemap_links(sitemap_url)
    data = {}

    for link in tqdm(sitemap_links):
        soup = get_soup(link)
        link_srcs = get_post_links(soup, 'prose')
        for src in link_srcs:
            data['url'] = link
            data['title'] = get_title(soup)
            data['post_link_href'] = src
            data['internal'] = is_internal(src, domain)
            df_links = pd.concat([df_links, pd.DataFrame.from_dict([data])])
    
    df_links.reset_index(inplace=True)
    df_links.pop('index')

    return df_links
    
if __name__ == '__main__':
    url = 'https://ak-codes.com/post-sitemap.xml'
    df = fetch_links_dataframe(url)

    # find only internal links
    df_internal = df[df['internal'] == True]
    print(df_internal.head(20))
    print('Number of total internal links:', df_internal['url'].count())
    print('Number of unique internal links:', df_internal['url'].nunique())

    # find only external links
    df_external = df[df['internal'] == False]
    print(df_external.head(20))
    print('Number of total external links:', df_external['url'].count())
    print('Number of unique external links:', df_external['url'].nunique())

    # find pages without any internal links
    df_no_links = df[df['internal'].isnull()]
    print(df_no_links.head(20))
    print('Number of pages without any links:', df_no_links['url'].nunique())

Conclusion

To conclude, we made a simple web scraping algorithm to fetch and filter internal and external links using Python. I learned a lot while working on this project and I hope you’ll find it helpful as well.

Share this article:

Related posts

Discussion(0)