How to audit a robots.txt file for crawling issues

A robots.txt file is a text file that contains instructions for web crawlers about which pages or files the robot should access or ignore while crawling a website.

The robots.txt file matters because URLs that cannot be crawled by Googlebot cannot be found on Google search.

They can look like this:

User-agent: Swiftbot
Allow: /latest/lpi-refunds.html
Allow: /latest/loan-repayment-cover-refund.html
Allow: /latest/loan-protection-insurance-refund.html

# /robots.txt file for https://www.commbank.com.au/
User-agent: *
Disallow: /references/
Disallow: /merchants/
Disallow: /mobile/

#Blog
Disallow: /blog.feed

#PDFs
Disallow: /commbank/assets/about/careers/B481-Assessment-Centre-Flyer.pdf

Sitemap: https://www.commbank.com.au/sitemap.xml
Sitemap: https://www.commbank.com.au/content/dam/commbank/root/articles-sitemap.xml

When configured poorly, this publicly accessible file can cause a great deal of crawling issues for a website.

For example, when I started a new role in-house, I came across the following:

User-agent: *
Disallow: /

And this immediately rang alarm bells.

This is why robots.txt is one of the first things I look at when doing a technical SEO sweep.

In this guide, I am going to walk you through some of the most common issues you should look for in a robots.txt file.

You can follow along by using my technical SEO audit checklist.

Let’s get started!

Is there a robots.txt file?

Background information:

  • Not all websites have a robots.txt. In such cases, search engines will crawl everything because no rules have been defined.
  • This is not a bad thing and is perfectly normal for smaller websites with less than a hundred webpages.
  • Larger websites with thousands of URLs can benefit from having a robots.txt file.
  • But when configured incorrectly, a robots.txt can easily prevent a website from being found on Google search.

How to check if a website has a robots.txt file:

  1. Append “robots.txt” to your root domain (e.g., www.danielkcheung.com/robots.txt)
  2. If the page loads in your browser, a robots.txt file exists
    • In this instance, mark ‘yes’ in the checklist
  3. If you get a 404 error, a robots.txt filed does not exit
    • In this instance, mark ‘no’ in the checklist

Possible answers:

  • yes
  • no
  • not applicable

What this means:

> If ‘yes’, review if any crawlers and search engines have been blocked.

>> If ‘no’, consider creating and uploading a robots.txt file especially if you suspect Google is having difficulty discovering your URLs.

Recommended reading:

Are crawlers or search engine crawlers being blocked in the robots.txt file?

Background information:

  • The role of the robots.txt file is to instruct one or more search engines which files and URL paths it can crawl
  • You can apply rules to all search engines or give specific rules to a particular search engine
  • Therefore, it is perfectly normal to see certain crawlers being blocked from certain files and URLs

How to check if robots.txt is blocking Google using Search Console:

Note: the following does not work for Domain-verified GSC properties – if you don’t have URL-prefix properties, use the manual eyeball test

  1. Go to Google’s Robots Testing Tool
  2. Scroll through the robots.txt code to locate any highlighted syntax warnings and logic errors
  3. Type in the URL of a page on your site in the text box at the bottom of the page
  4. Select the user-agent you want to simulate in the dropdown list to the right of the text box
  5. Click the TEST button to test access
  6. Check to see if TEST button now reads ACCEPTED or BLOCKED to find out if the URL you entered is blocked from Google web crawlers
  7. Edit the file on the page and retest as necessary.
  8. Copy your changes to your robots.txt file on your site.

How to check if robots.txt is blocking any search engine manually:

  1. Open the robots.txt file in your browser
  2. See if any of the following instances exist in the file:
    • User-agent: Googlebot
      Disallow: /
    • User-agent: Bingbot
      Disallow: /
    • User-agent: *
      Disallow: /
  3. If you see any of the above instances, mark ‘yes’ in the checklist
  4. If you do not see any user-agents being blocked, mark ‘no in the checklist’.

For example:

User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /orders
Disallow: /checkouts/
Disallow: /blogs/*+*
Disallow: /blogs/*%2B*
Disallow: /blogs/*%2b*
Disallow: /*/blogs/*+*
Disallow: /*/blogs/*%2B*

Possible answers:

  • yes
  • no
  • not applicable

What this means:

> If ‘yes’, one or more crawlers or search engines are being blocked. This is not necessarily wrong and your next step is to understand what is being blocked and why.

>> If ‘no’, the robots.txt is not causing any crawling issues.

Recommended reading:

Are paginated URLs being blocked in robots.txt?

How to check if paginated series are being blocked in robots.txt:

  1. Open the robots.txt file in your web browser
  2. See if any of the following instances exist in the file:
    • Disallow: /blog-page/page
    • Disallow: /?page=
    • Disallow: /&page=
  3. If you see any of these in the robots.txt file, check ‘yes’ in the spreadsheet.

What this means:

> If ‘yes’, verify with the site owner why this has happened. Paginated series should not usually be blocked in robots.txt although there are specific scenarios where this may be done.

>> If ‘no’, the robots.txt is not causing any crawling or indexing issues in relation to deeper URLs found on paginated series.

Recommended reading:

Are JavaScript files being blocked in robots.txt?

How to find out if JS is being blocked in robots.txt using Screaming Frog:

  1. Go to CONFIGURATION > SPIDER > RENDERING and change rendering from “text-only” to “JavaScript”
  2. Change AJAX TIMEOUT from 5-seconds to 8-seconds then click the OK button
  1. Crawl the entire site, a section within the website, or a sample of the website
  2. Select any HTML file that has been crawled and navigate to RENDERED PAGE tab
  3. See if the content of the page has loaded
    • If you do not see the page loaded correctly in the RENDERED PAGE tab, proceed to the next step
    • If you see the page load correctly, proceed to the next step to verify if JS are blocked in robots.txt file
  4. Go to BULK EXPORT > RESPONSE CODES > BLOCKED BY ROBOTS.TXT INLINKS and open the .csv file and scan for JS that has been blocked in robots.txt

How to find out if JS is being blocked in robots.txt manually:

Note: For most WordPress websites, JavaScript comes from the theme and is typically housed in ../wp-content/themes/yourTheme. Therefore, load the robots.txt file in a web browser and see if /wp-content/ is in the file.

Possible answers:

  • yes
  • no
  • not applicable

What this means:

> If ‘yes’, JS is being blocked and this may be making it harder for Google to render the content on the pages you want to rank.

>> If ‘no’, JS files are not being blocked from crawlers in the robots.txt file. If the website is experiencing indexing issues, this is not the cause.

Recommended reading:

Does the robots.txt file disallow parametised URLs?

How to check if parametised URLs are blocked in robots.txt:

  1. Open the robots.txt file in your web browser
  2. See if any of the following instances exist in the file:
    • User-agent:*
      Disallow: /products/t-shirts?
    • User-agent:*
      Disallow: /products/jackets?
  3. If you do, check ‘yes’ in the spreadsheet.

Possible answers:

  • yes
  • no
  • not applicable

What this means:

> If ‘yes’, one or more parametized URLs are blocked in robots.txt. In most use cases, parametised URLs should not be disallowed in robots.txt file because Googlebot cannot read the rel-canonical if it has been blocked from crawling said parametised URL.

>> If ‘no’, there is another reason why the website is experiencing indexing issues.

Recommended reading:

Is the sitemap index URL(s) referenced in the robots.txt file?

Background information:

User-agent: *
Disallow: /references/
Disallow: /merchants/
Disallow: /mobile/

Sitemap: https://www.commbank.com.au/sitemap.xml
Sitemap: https://www.commbank.com.au/content/dam/commbank/root/articles-sitemap.xml

How to check if robots.txt mentions the sitemap address:

  1. Open the robots.txt file in your browser
  2. Search for “sitemap” and see if a full URL has been provided
    • If you see this, check ‘yes’ in the spreadsheet
    • If you do not find “sitemap” in the robots.txt, check ‘no’ in the spreadsheet
    • If there is no robots.txt file, check ‘not applicable’ in the spreadsheet

Possible answers:

  • yes
  • no
  • not applicable

What this means:

> If ‘yes’, the sitemap is referenced in robots.txt and crawlers such as Googlebot can easily find and crawl it.

>> If ‘no’, the sitemap is not referenced in robots.txt. As long as the sitemap has been submitted to Google Search Console, this is not a red flag. If no sitemap has been submitted to GSC, this should be something you address ASAP if crawling and indexing is an issue.

You may also like:

Similar Posts