finding pages that aren't indexed by google with python

Finding What Pages Are Getting Indexed By Google Using Python

In this post, we’re going to make a Python algorithm for finding pages that are not indexed by Google. Along with this, we’re also going to check for indexed pages, which we can’t find in the sitemap.

Purpose of finding such pages is so we can fix them and start driving new visitors via Google. In other words, indexed pages are eligible to appear in search results. Furthermore, it plays a significant role in SEO performance of your website, so every site owner should consider doing a regular checkup of their pages index statuses to identify potential issues.

On the other hand, we can have the opposite problem where Google indexes paginated content. This can cause a huge quantity of low performing pages to be indexed like paginated pages of posts from a certain category.

Setting up Google Search Console API

Google Search Console API will be instrumental for our project, because we’ll get the data about pages that Google Indexed. So without further ado, follow along with this step by step guide to get this thing working.

  1. Go to Google Cloud Console
  2. Create a project
  3. Search for Google Search Console API in the API library and enable it
  4. Got to Credentials and create a service account
  5. Add a key under service account options
  6. Download the JSON file of the credentials
  7. Go to Google Search Console
  8. Add a user under Settings > Users and permissions
  9. Copy the service account email address and set full permissions

After you set up your Google Search Console API access, you need to copy the credentials JSON file, you downloaded, to your project folder. For convenience, I suggest you also rename the file into something simple like credentials.json.

Let’s get coding

First of all, like with any other Python project we do on this blog, we’re going to import all the necessary modules and methods. And in order for us to access the Google Search Console API, we need to install Google API Python Client Module. The following pip command will take care of that.

pip install --upgrade google-api-python-client
import os
import requests
import pandas as pd
from datetime import date
from bs4 import BeautifulSoup
from google.oauth2 import service_account
from googleapiclient.discovery import build

Next, we’ll set up a few constants, which will hold values for the path of the main project file and information for authentication process for Google Search Console API.

We’re also going to set the maximum column width for pandas module, so we’ll be able to see full URLs when we print out a dataframe in the terminal.

ROOT = os.path.dirname(__file__)

API_SERVICE_NAME = 'webmasters'
API_VERSION = 'v3'
SCOPE = [
    'https://www.googleapis.com/auth/webmasters.readonly'
]

pd.set_option('max_colwidth', 100)

Next, we’re going to define a couple of methods for accessing and retrieving data from Google Search Console API.

Furthermore, first one is going to take care of the authentication process and return a service object, with which you can send queries to the API and retrieve historical data of your website, which we’ll take care of with the second method in the following snippet.

def auth_service(credentials_path):

    credentials = service_account.Credentials.from_service_account_file(
        credentials_path,
        scopes=SCOPE
    )

    service = build(API_SERVICE_NAME, API_VERSION, credentials=credentials)

    return service

def query(service, url, payload):
    response = service.searchanalytics().query(siteUrl=url, body=payload).execute()

    results = []

    for row in response['rows']:
        data = {}

        for i in range(len(payload['dimensions'])):
            data[payload['dimensions'][i]] = row['keys'][i]

        data['clicks'] = row['clicks']
        data['impressions'] = row['impressions']
        data['ctr'] = round(row['ctr'] * 100, 2)
        data['position'] = round(row['position'], 2)

        results.append(data)
    
    return pd.DataFrame.from_dict(results)

Now, that we have our API functions ready, we need to get all the links from our sitemap. The purpose of this is so we can later compare to the total indexed URLs from the Google Search Console.

Even more, the method we’re going to define will recursively crawl through the sitemap. This will fetch all links from sitemaps of all post types.

def get_sitemap_links(url, all_links = []):
    response = requests.get(url)
    if response.status_code == 200:
        try:

            soup = BeautifulSoup(response.text, 'xml')
            links = [loc.text for loc in soup.find_all('loc')]

        except:
            return
        
        else:

            for link in links:
                if link[-3:] == 'xml':
                    get_sitemap_links(link, all_links)
                else:
                    all_links.append(link)
            
            return all_links
    else:
        print(f'Request failed with status code {response.status_code}')
        return

Alright, now it’s time we put everything in action, which will include the authentication, querying, sitemap crawling, and finally comparing which links appear in the sitemap and not in Google Search Console.

if __name__ == '__main__':
    
    url = 'https://ak-codes.com/sitemap_index.xml'

    sitemap_links = get_sitemap_links(url)
    df_sitemap = pd.DataFrame(sitemap_links, columns=['page'])

    print('Total sitemap links:', len(sitemap_links))
    print(df_sitemap.head(20))

    payload = {
        'startDate': '2023-01-01',
        'endDate': date.today().strftime('%Y-%m-%d'),
        'dimensions': ['page'],
        'rowLimit': 10000,
        'startRow': 0
    }

    service = auth_service(os.path.join(ROOT, 'credentials.json'))

    df_gsc = query(service, service.sites().list().execute()['siteEntry'][0]['siteUrl'], payload)
    print(df_gsc.head(20))

    df_merged = pd.merge(df_gsc, df_sitemap, how='right', on=['page'])
    print(df_merged.head(20))

    print(df_sitemap.shape, df_gsc.shape, df_merged.shape)

    df_no_clicks = df_merged[df_merged['clicks'] < 1]
    df_no_clicks = df_no_clicks.sort_values(by='impressions', ascending=False)
    print(df_no_clicks)

    gsc_links = df_gsc['page'].tolist()
    all_links = list(set(sitemap_links + gsc_links))
    print('Total links:', len(all_links))

    shared_links = list(set(sitemap_links).intersection(set(gsc_links)))
    print('Total shared links:', len(shared_links))

    # links in sitemap but not in Google Search Console - non-ranking pages
    not_indexed = list(set(all_links).difference(set(gsc_links)))
    print('Total not indexed pages:', len(not_indexed))

    df_not_indexed = pd.DataFrame(not_indexed, columns=['page'])
    print(df_not_indexed)
    df_not_indexed.to_csv(os.path.join(ROOT, 'not indexed.csv'))

Bonus

We can also check which URLs are indexed in the Google Search Console that aren’t present in the sitemap. Therefore, indicating whether you have the opposite problem to non-ranking pages, otherwise known as search index bloat.

# links in Google Search Console but not in sitemap - index bloat
    index_bloat = list(set(all_links).difference(set(sitemap_links)))
    print('Total index bloat pages:', len(index_bloat))
    df_index_bloat = pd.DataFrame(index_bloat, columns=['page'])
    print(df_index_bloat)
    df_index_bloat.to_csv(os.path.join(ROOT, 'index bloat.csv'))

Complete project code

import os
import requests
import pandas as pd
from datetime import date
from bs4 import BeautifulSoup
from google.oauth2 import service_account
from googleapiclient.discovery import build

ROOT = os.path.dirname(__file__)

API_SERVICE_NAME = 'webmasters'
API_VERSION = 'v3'
SCOPE = [
    'https://www.googleapis.com/auth/webmasters.readonly'
]

pd.set_option('max_colwidth', 100)

def get_sitemap_links(url, all_links = []):
    response = requests.get(url)
    if response.status_code == 200:
        try:

            soup = BeautifulSoup(response.text, 'xml')
            links = [loc.text for loc in soup.find_all('loc')]

        except:
            return
        
        else:

            for link in links:
                if link[-3:] == 'xml':
                    get_sitemap_links(link, all_links)
                else:
                    all_links.append(link)
            
            return all_links
    else:
        print(f'Request failed with status code {response.status_code}')
        return
    
def auth_service(credentials_path):

    credentials = service_account.Credentials.from_service_account_file(
        credentials_path,
        scopes=SCOPE
    )

    service = build(API_SERVICE_NAME, API_VERSION, credentials=credentials)

    return service

def query(service, url, payload):
    response = service.searchanalytics().query(siteUrl=url, body=payload).execute()

    results = []

    for row in response['rows']:
        data = {}

        for i in range(len(payload['dimensions'])):
            data[payload['dimensions'][i]] = row['keys'][i]

        data['clicks'] = row['clicks']
        data['impressions'] = row['impressions']
        data['ctr'] = round(row['ctr'] * 100, 2)
        data['position'] = round(row['position'], 2)

        results.append(data)
    
    return pd.DataFrame.from_dict(results)

if __name__ == '__main__':
    
    url = 'https://ak-codes.com/sitemap_index.xml'

    sitemap_links = get_sitemap_links(url)
    df_sitemap = pd.DataFrame(sitemap_links, columns=['page'])

    print('Total sitemap links:', len(sitemap_links))
    print(df_sitemap.head(20))

    payload = {
        'startDate': '2023-01-01',
        'endDate': date.today().strftime('%Y-%m-%d'),
        'dimensions': ['page'],
        'rowLimit': 10000,
        'startRow': 0
    }

    service = auth_service(os.path.join(ROOT, 'credentials.json'))

    df_gsc = query(service, service.sites().list().execute()['siteEntry'][0]['siteUrl'], payload)
    print(df_gsc.head(20))

    df_merged = pd.merge(df_gsc, df_sitemap, how='right', on=['page'])
    print(df_merged.head(20))

    print(df_sitemap.shape, df_gsc.shape, df_merged.shape)

    df_no_clicks = df_merged[df_merged['clicks'] < 1]
    df_no_clicks = df_no_clicks.sort_values(by='impressions', ascending=False)
    print(df_no_clicks)

    gsc_links = df_gsc['page'].tolist()
    all_links = list(set(sitemap_links + gsc_links))
    print('Total links:', len(all_links))

    shared_links = list(set(sitemap_links).intersection(set(gsc_links)))
    print('Total shared links:', len(shared_links))

    # links in sitemap but not in Google Search Console - non-ranking pages
    not_indexed = list(set(all_links).difference(set(gsc_links)))
    print('Total not indexed pages:', len(not_indexed))

    df_not_indexed = pd.DataFrame(not_indexed, columns=['page'])
    print(df_not_indexed)
    df_not_indexed.to_csv(os.path.join(ROOT, 'not indexed.csv'))

    # links in Google Search Console but not in sitemap - index bloat
    index_bloat = list(set(all_links).difference(set(sitemap_links)))
    print('Total index bloat pages:', len(index_bloat))
    df_index_bloat = pd.DataFrame(index_bloat, columns=['page'])
    print(df_index_bloat)
    df_index_bloat.to_csv(os.path.join(ROOT, 'index bloat.csv'))

Conclusion

To conclude, we made a simple Python algorithm for checking which pages on our website aren’t getting indexed by Google. Furthermore, we also demonstrated how to check if there are too many pages getting indexed.

However simple this checkup may be, it can give you information, which you can leverage to significantly improve SEO performance of your website. I learned a lot while working on this project, and I hope you will find it helpful as well.