How To Check For Broken Links With Python
In this post, we’ll write a simple Python script, which we’ll use to check a website for broken links. Furthermore, this could be one of the indispensable tools for every SEO expert out there.
In case you’re not well versed in python programming language, don’t worry, we’ll go step by step and I’ll explain everything along the way. And for those of you, who like to explore various different concepts in programming, there will be useful things to learn for you as well.
In essence, we’re going to make a script that will utilize recursion, requests, command line arguments, and more. We’re also make the algorithm output a .csv
file, containing all broken links and their response status codes.
Setup
First of all, like with any python project, we need to import all necessary modules and functions.
import os
import csv
import requests
import argparse
from bs4 import BeautifulSoup
from tqdm import tqdm
from pprint import pprint
from urllib.parse import urlparse
Since we’ll output a file with this algorithm in case it finds broken links, I usually set a constant with directory in which the script lies. This way, we can output it anywhere we put the script, automatically adjusting the path of the output files.
ROOT = os.path.dirname(__file__)
Fetch all available links
Next, we’ll make a method for fetching all the links that are registered inside the websites sitemap. Furthermore, we’ll use recursion, meaning we’ll call this method inside itself. The purpose of this is to get links from sitemaps that the main sitemap might contain.
For example, Yoast SEO plugin for wordpress generates a sitemap which contains other sitemaps for posts, categories, tags, and more, separately. Therefore, we’ll need to fetch each of them separately, where recursion comes in handy.
def get_sitemap_links(url, all_links = []):
response = requests.get(url)
if response.status_code == 200:
try:
soup = BeautifulSoup(response.text, 'xml')
links = [loc.text for loc in soup.find_all('loc')]
except:
return
else:
for link in links:
if link[-3:] == 'xml':
get_sitemap_links(link, all_links)
else:
all_links.append(link)
return all_links
else:
print(f'Request failed with status code {response.status_code}')
return
Check for broken links
Once we get all available links, we’ll need to check each of them if we get a response status code that indicates an error. And in case it does, we’ll save that link and its response code into a list. After the algorithm finished checking links, it will output this list into a .csv file in the same directory as the script.
def save_broken_urls(base, urls):
broken_urls = []
# check for broken urls
for url in tqdm(urls, desc='Checking links'):
response = requests.get(url)
if response.status_code != 200:
broken_urls.append([url, response.status_code])
# write broken urls into a csv file if they exist
if broken_urls:
pprint(broken_urls)
with open(os.path.join(ROOT, f'broken_links_{base}.csv'), 'w', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerows(broken_urls)
else:
print('Your website doesn\'t have any broken links.')
Here is the method that will take care of this, however, we still need to put both into use. Therefore, we’ll call them inside the environment where the top level code will execute.
Furthermore, we’ll use command line argument to pass the url of websites sitemap. This way we’ll be able to execute the script from command line.
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Check website for broken links.')
parser.add_argument('sitemap', help='Supply the sitemap link of the website.')
args = parser.parse_args()
sitemap_url = args.sitemap
base = urlparse(sitemap_url).netloc
urls = get_sitemap_links(sitemap_url)
if urls:
save_broken_urls(base, urls)
Alright, now we’re ready to execute our algorithm inside a command console. In case you’re not sure how to do this, here is how you format it.
python name-of-python-file.py link
And here is a simple example.
python broken_link_checker.py https://ak-codes.com/sitemap_index.xml
As it’s doing its thing, you’ll be able to track its progress with the progress bar from tqdm module.
And finally, here is also the entire code of this project.
import os
import csv
import requests
import argparse
from bs4 import BeautifulSoup
from tqdm import tqdm
from pprint import pprint
from urllib.parse import urlparse
ROOT = os.path.dirname(__file__)
def get_sitemap_links(url, all_links = []):
response = requests.get(url)
if response.status_code == 200:
try:
soup = BeautifulSoup(response.text, 'xml')
links = [loc.text for loc in soup.find_all('loc')]
except:
return
else:
for link in links:
if link[-3:] == 'xml':
get_sitemap_links(link, all_links)
else:
all_links.append(link)
return all_links
else:
print(f'Request failed with status code {response.status_code}')
return
def save_broken_urls(base, urls):
broken_urls = []
# check for broken urls
for url in tqdm(urls, desc='Checking links'):
response = requests.get(url)
if response.status_code != 200:
broken_urls.append([url, response.status_code])
# write broken urls into a csv file if they exist
if broken_urls:
pprint(broken_urls)
with open(os.path.join(ROOT, f'broken_links_{base}.csv'), 'w', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerows(broken_urls)
else:
print('Your website doesn\'t have any broken links.')
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Check website for broken links.')
parser.add_argument('sitemap', help='Supply the sitemap link of the website.')
args = parser.parse_args()
sitemap_url = args.sitemap
base = urlparse(sitemap_url).netloc
urls = get_sitemap_links(sitemap_url)
if urls:
save_broken_urls(base, urls)
Conclusion
In conclusion, we made a simple broken link checker with python. I learned a lot while working on this project, and I hope you’ll find it helpful as well.