Crawling

C

Crawling is the indispensable first step that allows a search engine to discover any landing page on any website through a bot before storing the details of the said page in its index. At its core, Search Engine bots, including Googlebot, discover pages to crawl by clicking both the inbound external links and internal links found on the pages that are already a part of their index.

Not only is crawling the only way for search engine bots to discover new pages to be included in the search engine index, but also to update the records of existing pages which may have gone through updates since they were last crawled. In other words any SEO change, however big or small, content or tech-related, made to a landing page will not reflect on the organic keyword rankings until the said page is crawled by the search engine bot with the new content or technical updates on it.

Upon visiting a website, bots will also attempt to read a text file placed under the domain.com/robots.txt URL to understand what pages not to crawl upon website administrator’s request and locate the sitemap.xml or sitemap-index.xml on the last line of the file that will list all the pages for it to indeed crawl on any given website, regardless of whether there are internal links to them on the website, external inbound links to them from other websites or none at all.

Why is Crawling Critical for SEO?

Crawling is the absolute foundation of search engine optimisation because it directly impacts a page’s visibility. If a search engine bot cannot discover and crawl a page, that page essentially doesn’t exist to the search engine. It cannot be added to the index, and consequently, it will never rank for search queries or drive organic traffic. All subsequent SEO efforts, from content creation and keyword optimisation to link building are rendered entirely pointless if the page itself remains undiscovered by crawlers. Therefore, ensuring your pages are easily crawlable is the non-negotiable first step to achieving any success in search results.

Before diving into the nuances of how crawling can be controlled through the use of internal and external links, the robots.txt file and robots meta tags – it is important to bring the notion of Crawl Budget into the equation.

What is the Crawl Budget and How to Optimize It for SEO?

The crawl budget is the amount of time and resources that a search engine, like Google, allocates to crawl a website. It’s essentially the number of pages a search engine bot will crawl on a site within a certain timeframe. While Google states that crawl budget is not a concern for most small to medium-sized websites, it becomes a critical factor for large sites with tens of thousands of pages, or sites that auto-generate pages.

How the Crawl Budget is Determined

Ultimately the crawl budget is determined by Google’s crawl capacity on one side and its demand for the content of your website on the other:

  1. Crawl Capacity (or Crawl Rate Limit): The maximum number of simultaneous connections a search engine bot can use to crawl your site without degrading its performance for users. If a site responds quickly and has a healthy server, the capacity limit increases. If the site is slow or returns server errors, Googlebot will slow down its crawling.
  2. Crawl Demand: How much Google wants to crawl your site. It’s influenced by:
    • Authority: URLs that are authoritative and have more external inbound links are seen as more important and are crawled more frequently to keep them fresh in the index.
    • Staleness: Search engines aim to re-crawl pages to pick up any changes. If your content is updated frequently, crawl demand may increase.
    • Quality: Google tries to crawl all known URLs on your site. If a significant portion of these are low-value or duplicates, it can dilute the crawl demand for your important pages.

In essence, your crawl budget is the set of URLs that Googlebot can and wants to crawl.

Optimising your crawl budget means guiding search engine bots to spend their limited resources on the pages that indeed require to be (re)crawled, ensuring they are discovered and updated promptly. Wasting this budget on unimportant pages can lead to delays in indexing or updating your content and ultimately, getting it on the screen of your potential website visitors.

URL Normalization and Its Implications for Crawling in WordPress

URL normalisation is the process by which search engines convert different variations of a URL that point to the same content into a single, consistent, and preferred version. This is critical for efficient crawling because without proper management search engines might perceive multiple URLs for the same content as separate pages which can lead to wasted crawl budget.

While search engines, including Google perform various normalization steps automatically, understanding these aspects helps you proactively manage your WordPress site’s URLs and guide crawlers effectively, especially when default WordPress behavior or plugins introduce variations.

Here are the most common scenarios for URL normalisation that, when considered by the website owner, will positively reflect on crawling by simplifying Googlebot’s job of sorting through them:

  1. Standardizing Protocol (HTTP vs. HTTPS) and Subdomain (www vs. non-www): Ensuring your site consistently uses https://www.yourdomain.com or https://yourdomain.com is crucial. This is typically managed via your WordPress General Settings (WordPress Address and Site Address URLs), server-level redirects (i.e. in .htaccess), and potentially SSL plugins. If your WordPress site is accessible via multiple protocols (HTTP and HTTPS) or subdomains (www and non-www) without proper redirects, search engine crawlers might find and crawl redundant versions. This wastes the crawl budget. Proper 301 redirects are essential to normalize these variations to your preferred, secure version for efficient crawling.
  2. Stripping Default File Names (e.g., index.php): For WordPress, the Post name permalink structure (i.e. nurdic.com/your-post-title/) automatically handles the stripping and normalisation of default filenames like index.php from your URLs through rewrite rules (typically in your .htaccess file). This ensures that URLs for posts, pages, and categories do not include these filenames, preventing the crawl budget from being wasted on such variations. If your permalinks are by default set to “Plain” in WordPress’ Permalinks Settings (e.g., yourdomain.com/?p=123), such filenames might be exposed.
  3. Case Sensitivity: While WordPress itself might be somewhat forgiving with URL casing in some internal functions, the search engines are generally case-sensitive for URL paths (i.e. yourdomain.com/AboutUs/ can be seen as different from yourdomain.com/aboutus/). Inconsistent casing in your WordPress URLs (i.e. if you manually link to /AboutUs/ in one place and /aboutus/ in another) can lead to search engines crawling both as distinct pages, wasting crawl budget. It’s a best practice to maintain consistent, typically lowercase, URLs across your WordPress site to ensure efficient crawling. 
  4. Trailing Slashes: WordPress’s permalink settings allow you to choose whether your post and page URLs end with a trailing slash (e.g., yourdomain.com/post-name/ vs. yourdomain.com/post-name). Google generally treats URLs with and without trailing slashes as the same for the root domain and subdomains, but for subdirectories (like posts or pages), they can be seen as different. Inconsistent use or mixed signals can lead to search engines crawling both versions, wasting crawl efforts. It’s important to pick a consistent trailing slash preference in your WordPress settings and stick to it.
  5. Utilise the normalised URLs in Internal Linking: It’s indeed important to use normalised URLs across the website, including only the secure protocol (i.e. https), either only the www or the non-www version of the domain, enforce lowercase-only URL slugs and trailing slashes (or lack thereof). However, in order to truly and fully optimise the crawling budget, internal links must align to said URL normalisation standards and use only the normalised versions of the URLs, not the alternatives.

Considering setting your own URL normalisation standards and employing them in internal linking is key to ensuring search engines efficiently discover and crawl your WordPress content in the most efficient manner, preventing wasted crawl budget.

Here are actionable steps to optimize your crawl budget by preventing the Crawling of Low-Value URLs; Aid the Crawling of High-Value URLs and Improving Site Performance.

The most significant waste of crawl budget comes from asking search engines to crawl pages that offer no unique value:

  1. Utilize robots.txt: Use your robots.txt file to block entire sections of your site that provide no SEO value, such as admin login pages, internal search results, or test environments. This is a direct command to not crawl those paths.
    • Infinite Scroll & Session IDs: Block crawling of URLs generated by session IDs or infinite scroll mechanisms that duplicate content found on other pages.
    • Faceted Navigation: E-commerce sites often generate thousands of URLs from filters and sorting options (e.g., ?sort=price_asc, ?color=blue). These pages often display the same content in a different order. Use robots.txt to block crawling of URLs with parameters that don’t create unique, valuable pages.
  2. Maintain a Clean XML Sitemap: Your sitemap should only contain URLs that you want search engines to crawl and index. Ensure it doesn’t include blocked URLs, non-canonical URLs, or redirects. Keep it updated as you add or remove content.
  3. Manage Crawlable Links:
    • Fix Broken Links (404s): When a bot hits a 404 error, it’s a wasted request. Regularly use a site crawler to find and fix broken internal links.
    • Minimize Redirect Chains: Each redirect is an extra request. A long chain of redirects (i.e. Page A -> Page B -> Page C) wastes crawl budget. Update internal links to point directly to the final destination URL.

Apart from preventing search engines from crawling low-value pages, you can also actively encourage them to place a greater importance on crawling your high-value pages by:

  1. Strengthen Internal Linking: Pages with many internal links are seen as more important. Ensure your key pages have a strong internal linking structure so bots can easily discover and reach them.
  2. Employ Canonical Tags: Consolidate duplicate or near-duplicate pages using canonical tags (rel=”canonical”) to point search engines to the primary version you want indexed. While this does not necessarily prevent the crawling of non-canonical pages, the Search Engines will place a greater importance on crawling the canonical pages instead of the non-canonicals.

Finally, if your expertise allows it, you can further improve how effectively your crawl budget is spent by investing in improving your website performance and maintaining your server health. You’ll need to to use Google’s Proprietary tool to understand if these will positively reflect on your crawl budget management.

  1. Increase Site Speed: A faster website means Googlebot can make more requests in the same amount of time without overwhelming your server. Optimize images, leverage browser caching, and use a fast web host.
  2. Maintain Server Health: A high number of server errors (5xx codes) signals to Google that your site can’t handle the crawl rate, causing it to slow down or stop. Monitor your server health and fix errors promptly.
  3. Use the Crawl Stats report in Google Search Console: Get an understanding of how Google is interacting with your site through the Googlebot.
    • In Google Search Console, navigate to Settings > Crawling > Crawl stats.
    • This report shows you the total number of crawl requests, your average response time, and any host availability issues.
    • A sudden drop in crawl requests could indicate a problem, while a spike in response time suggests your server is slowing down. Reviewing this data can help you identify and address issues before they significantly impact your SEO.

Internal Linking

As far as crawling is concerned, the use of internal links and the presence of inbound external links from other websites can have both similar and completely different implications.

Although search engine bots, including Googlebot crawl pages on any given website by following internal links from page to page, it is generally believed that the longer the chain of links the bot has to go through to discover a page, the less of a chance it has to reach those deeper pages. The SEO best practice on this subject would suggest that most important website URLs, if not all the website URLs, must be accessible within 3 clicks from the homepage.

A side-note on the SEO best practice for internal linking would be the number of internal links used to any one URL. It is generally good practice to make all website URLs accessible through more than a single link. The reasoning here is that if one single link leads to a page and Googlebot for known or unknown reasons fails to follow it, that page will not be crawled. That is assuming the page is also not referenced in the XML sitemap.

Internal Linking Structure: Main Navigation links, footer links, content links

To create an internal linking structure that is meant to successfully serve both users and search engines, It’s important to make the distinctions between the different types of links, including the main navigation links, the footer links and the content links. No doubt, the main navigation links must accommodate the most important links for the primary users, such as your customers while the footer must accommodate the most important links for your other audiences, such as potential employees or partners as well as search engines.

  1. For instance, you might prefer not to show your “Careers” or “Partners” page to your customers within the main navigation, but you want to make it readily available anywhere on the website for potential job applicants or partners. Because by default the footer, just like the main navigation is repeated across the website, these pages will be available to your secondary audience from any page on your website.
  2. As far as search engines are concerned, these pages are indeed important website pages and you will most certainly want them regularly crawled for updates, indexed and ranked which is why it’s so important not only to have them on the website, but make them accessible in the footer which is repeated across every page on the website and by extension, makes it impossible for search engines to miss or ignore them.

Assuming you have a website of at least a hundred pages, naturally you won’t be able to accommodate all those links under either the main navigation or the footer, nor should you aim to do so. The remaining less-important links are typically found within the main body of the content of various pillar pages or, on occasion, they may be part of a sidebar.

Without diving too deep into the use of internal links within the body content or the sidebar, you may employ this type of internal links to supplement the core internal linking structure set by the tandem of the main navigation and the footer, specifically to ensure that the pages on your website are available through more than a single link on any one page.

Internal Linking Structure: Categories and Tags as internal links

Moving on to the next type of challenge, some pages, such as blog posts will only be accessible from the main Blog page or its paginated pages. If you have hundreds or thousands of blog posts and they’re spread across dozens or hundreds of paginated pages, it may be more complicated for search engine bots to reach those pages – especially considering that by default, they will only be reachable through a single link.

Of course, interlinking other blog posts within the body of each blog post will be helpful in this scenario and you should seek out to link to blog posts on relevant and related topics directly in the body text of the blog post. However, this can be addressed in a scalable way by the use of blog categories and tags which are available by default in WordPress.

By design, both blog categories and tags are pages that hold all of your posts related to a subject where any one blog post can have a single category and multiple tags. Each category and tag have a dedicated page of their own listing all of the posts associated to that category or tag. In the specific context of the use of the internal linking structure to benefit crawling of blog pages, the category and any one or multiple tags associated with a blog post can be listed alongside the main content on the blog post as internal links, thus providing an alternative way for crawlers to discover blog pages.

Internal Linking Structure: Custom taxonomies as internal links to custom post types

By default WordPress pages can be nested under 2 pages type: Pages and Posts where Pages are generally used for main website content, while the Posts are used for a single Blog entry and are usually listed in a chronological order on the main Blog page. However, depending on the complexity of your website and your desire to keep your content organised both on the front-end and the back-end, WordPress can accommodate Custom Post Types. Examples of Custom Post Types include Products, Testimonials, Case Studies, Clients and anything else you can think of.

The main benefit of creating Custom Post Types apart from the organisation aspect is creating one single or multiple page layouts and re-using it with every new Page within that particular Post Type. For instance, you can set the page design for a Product Page once with the help of a developer or designer and allow a Content Manager to use that design again and again to promptly add new Products by populating various content fields within that Page Layout.

So if Custom Post Types (i.e. Products, Clients) are alternatives to WordPress’ default Pages and Posts, Custom Taxonomies are alternatives to WordPress Blog Categories and Tags. By default, you can continue to use Categories and Tags as taxonomies for your Custom Post Types (i.e. Products, Clients), but separately from the Blog’s Categories and Tags. This way your Product Category Pages lists only products while the Blog Category Pages will continue to list only Blog Posts belonging to that category.

Custom Taxonomies assume you want to move past the default options of Categories and Tags and set your own values for categorising Custom Post Types. For instance, it would make sense for a “Clients” Custom Post Type to have a custom taxonomy titled “Industry” or “Client-Industry”. Very much like in the case of Blog Posts and their associated Categories and Tags, every “Client” Post Type page can have a dedicated “Industry” or “Client Industry” Page that will hold references to all “Client” Pages related to that industry.

Internal Linking Structure: Breadcrumbs (desktop & mobile)

The use of Breadcrumbs is yet another way to enhance internal linking for the benefit of both the user and search engines in a scalable way. There are several different types of breadcrumbs, including hierarchical (location) breadcrumbs, attribute-based (faceted) breadcrumbs, path-based (history) breadcrumbs or a breadcrumbs set-up using a combination of certain elements of these options.

The topic of the effective use of Breadcrumbs on websites is one in its own right having implications for users, search engine bots or the way entire websites are crawled and even the way any given page is displayed in search engine results pages (SERPs). However, there is no silver bullet when it comes to selecting the right type of Breadcrumbs for your website. To do just that you will have to consider such aspects as the nature of your website and how users interact with it, the complexity of the website hierarchy and what kind of internal linking gaps you’re trying to fill with them?

For instance, if you have a content website and it’s got a relatively complex hierarchy which may be difficult for both users and bots to fully grasp, you may introduce hierarchical breadcrumbs, the kind of set-up that will supplement the internal linking gaps of your website hierarchy. If the onpage internal links for the most part lead to other internal website pages further down the hierarchical order or those on the same level, the breadcrumbs links in this particular case will lead further up the hierarchical order.

The main implication of this breadcrumbs set-up is that when a search engine bots discovers and visits one of your website’s deep pages, it now has the link back to the pillar page of the page in question and its pillar page and so on, all the way back to the homepage. In other words, the search engines have the option to crawl the website in the opposite direction of its hierarchy, something which wouldn’t have necessarily been possible without the hierarchical breadcrumbs.

nurdic.com > SEO consultant > How Search Engines Work

On the other hand, if you have an ecommerce store with a long string of filtering options for your products, you may employ an attribute-based Breadcrumbs set-up. This implies users will be able to visualise their journey as they apply filters or “attributes” to their search which has the potential of positively reflecting on the user interaction as they’ll be able to discard the last one of any number of their previously selected filters and navigate back to a page they’ve already been on in case they’re not satisfied with the product selection on the last page they visited.

When it comes to the crawling of attribute-based breadcrumbs on ecommerce sites by search engine bots, the implications can be more important than those of hierarchical breadcrumbs on content sites. Attribute-based breadcrumbs, as you may expect, provide the means for search engine bots to crawl the website across filtering options through the breadcrumb links.

This may prove crucial for the search engine bots’ ability to discover and crawl the pages generated by product filters, particularly if the search engine bots are not able to crawl the links through the filtering feature. In this particular case the attribute-based breadcrumbs may become the only means for search engines to discover, crawl and subsequently index these Product Listing Pages on the site, that may otherwise remain inaccessible to them.

nurdic.com > men's > shoes > boots > Size 8

Internal Linking Structure: Filtering options for ecommerce websites

While the attribute-based breadcrumbs can serve as an easy fix for crawlability on ecommerce websites, it’s best to employ it as a supplement rather than a replacement for the crawling through the filtering links. If you have an ecommerce store that uses product filtering, there are certain minimal conditions and best practices to keep in mind in order to make your desired product listing pages available to the search engines:

  1. Your filtering links must lead to unique URLs, whether they’re static URLs, dynamic URLs with parameters (where the selected filter is represented by an attribute) or dynamic URLs disguised as static URLs. If upon clicking any of the product filtering links, the filtering happens directly on the page without leading to a unique URL, search engine bots will have no dedicated URL to crawl and subsequently index and rank in the search results. It may be important to clarify that URLs with the hash sign (#) are not treated as unique URLs and thus will not be crawled, indexed and subsequently ranked either.
  2. Secondly, search engine bots discover new URLs by parsing a page’s HTML for standard <a href=”…”> tags. This is the gold standard for discoverability. If your filtering links are generated by JavaScript, they risk not being crawled reliably. This is because Googlebot processes pages in two waves: it crawls the raw HTML first and only renders the JavaScript later. Any delay or error in this second rendering phase can cause JS-injected links to be missed entirely.
  3. Thirdly, it’s important to establish that just because a link is crawlable, it does not by extension imply that it will indeed be crawled by Googlebot. While this topic might not be as straightforward as some of the other ones and there may be numerous factors at play, the third consideration becomes getting a grip over the actual pages that you might want to be crawled and indexed. Depending on the number of filters, the options within it as well as the number of URLs generated by the interplay of different combinations of filter options, the website may end up with a hefty number of URLs. Although this isn’t necessarily an issue, it becomes one when the products listed under these URLs aren’t exactly unique to any of the pages due to them showing up again and again under all the different URLs that were created based on the attributes assigned to those products. Among other things you’ll have to consider a solution combining filtering links to nofollow, filter-based URLs to noindex and an effective use of canonical tags to tackle duplication across Product Listing Pages as part of your Crawl Budget Optimisation.
  4. Finally, while it’s important to consider each of the previous points in order to make your filtering links accessible to search engine crawlers, doing so won’t guarantee neither the crawlability nor the actual crawling of your pages. Search Engine bots, including the Googlebot can encounter challenges in crawling your filtering links, especially in scenarios where the code that affords the functionality of the filtering is generally considered non-standard or complex. So, rigorous testing is advised at every stage of implementing any of the provided suggestions.

You’ll likely need a qualified developer to account for these basic recommendations on making your filtering crawlable.

Internal Linking Structure: Redirects

As previously outlined, at its core search engine crawlers follow links from page to page and when any one URL changes the links that were previously pointing to it need to be updated to point to the new URL version – otherwise the links will not work. Naturally, the internal links can simply be edited to point to the new URL version. However, this may become somewhat time-consuming when structural website changes result in hundreds or thousands of URL changes, where each URL has multiple internal links from various website pages, in many cases hard-coded into the content.

This is where page redirects come in. A page redirect specifies that all users and bots trying to access a particular URL (called the source URL) must be redirected to a different URL (called the destination URL). Thus, by redirecting a URL all the internal links using that URL remain valid and do not need to be changed to point to the new destination URL directly.

The use of redirects, however, have multiple implications for the website crawling and although it would be best practice to limit the number of redirects a website uses as much possible, depending on the website’s size and the frequency of structural changes – most websites do continue to rely on URL redirects to support important website changes.

If you absolutely must use redirects, there are only 2 key points to keep in mind:

  1. Try to prevent the occurrence of URL redirect chains, where one URL redirects to another URL that then redirects to yet another URL or chain of URLs. This issue may come up as an unintentional consequence of multiple structural website changes over time. When present, redirect chains will negatively reflect on the crawling in multiple ways, including:
    1. Long redirect chains may result in the search engine bots not following the chain all the way through to the final destination URL and as a result the destination URL will not be crawled.
    2. The loading of every URL as part of the URL chain consumes resources that would otherwise be spent somewhere else on the website, where they may be, in fact, spent more wisely.
  2. On rare occasions, manual errors in the use of redirects may lead to Redirect Loops where the source and the destination URLs within the URL redirect are identical, resulting in a loop that constantly redirects one URL to that same URL. This will have similar consequences and namely, the desired URL from the link will almost certainly not be crawled and it will waste the search engine bot resources in vain.

Although not directly related to Crawling it may be worth noting the use of redirects may also reflect upon other aspects of how search engines work, including the Ranking of the URL, well after the Crawling took place.

External Inbound Links’ Influence over Crawling

Search engines allocate crawl resources based on a page’s perceived importance, and high-quality external inbound links are a primary signal of that importance. Links from authoritative domains tell search engines that your content is valuable, which encourages them to crawl your site more frequently.

The direct benefit of more frequent crawling is that any changes you make to your content will be discovered and reflected in the search engine results pages (SERPs) much faster. Conversely, a lack of authoritative inbound links can cause significant delays in how promptly your pages are updated in the index.

The attributes of an inbound external link also provide context. While the rel=”nofollow” attribute once discouraged crawling, it now functions as a hint, alongside rel=”sponsored” (for paid links) and rel=”ugc” (for user-generated content). For crawling, Google treats all three attributes identically: they primarily signal not to pass authority, but they no longer prevent a link from being crawled.

While the inbound external links are the links on other websites that lead to yours, the outbound external links are the ones on your website leading to the other websites. You must employ nofollow, sponsored and ugc according to their intended purpose for external links only and almost under no circumstance should you use them on internal links.

Robots.txt

As previously outlined the primary role of the robots.txt file is to specify the website areas to restrict the crawling by search engine bots, primarily to manage the crawl budget. In practice, there are various considerations at different levels to keep in mind in order to accomplish it:

  1. The primary intended purpose of the robots.txt file is to define the areas of the website, including specific URLs or URLs that follow a pattern (i.e. all URLs that contain a select word), subdirectories, or any specific file types which the search engine bots must not attempt to crawl.
    1. Search engine bots operate under the default assumption that they are allowed to crawl any page or directory not blocked in the robots.txt file.
    2. Unlike in the case of the nofollow attribute, used on internal links and the nofollow attribute of the robots meta tags on any given page which search engine bots, including Googlebot can, on occasion, choose to ignore, the Googlebot will always respect the robots.txt instructions, provided there are no errors that prevents it from understanding the file.
  2. The instructions can be set for any one, multiple or all of the search engine bots (User-Agents) attempting to crawl your website and the file can accommodate the use of different instructions for different user agents or bots.
  3. The file also accommodates the location of the XML sitemap or the XML sitemap index which is arguably the most appropriate way to ensure your XML sitemap is discovered by all search engines, including Google.

Not only is it important to acknowledge the role of the robots.txt, but also account for certain conditions that must be met in order for the file to work as expected. For WordPress, manage your robots.txt file directly in the General Settings of your Rankmath SEO Plugin.This will enforce certain required standards and keep it free from manual errors, including:

  1. The robots.txt file must be a UTF-8 encoded text file (which includes ASCII), otherwise it may not necessarily be understood by Google or other search engine bots.
  2. The file must be placed at the root of the website (i.e. domain.com/robots.txt).
    1. The file upload method may vary across hosting providers or depend on other factors, which may be difficult or somewhat time-consuming for non-technical users to master.
    2. There can only be one single robots.txt file per domain.
  3. The robots.txt file only applies to the domain or the subdomain it is placed at the root of.
    1. A robots.txt file placed at the root of the domain (i.e. domain.com/robots.txt) will not enforce the rules over the domain’s subdomains (i.e. subdomain.domain.com/robots.txt)
    2. Each subdomain is treated as separate and will require its own robots.txt.
  4. The robots.txt file applies to a single protocol, so if your robots.txt file is placed at the root of your domain using the https protocol, but you happen to have pages on your website using the http protocol – the robots.txt file in question will not enforce any crawling instructions over any of the pages that use the http protocol.

Nurdic.com’s robots.txt file contents

# Ensures the ahrefs bots can crawl the website

User-agent: AhrefsSiteAudit

User-agent: AhrefsBot

Disallow: /wp-admin/

Allow: /

User-agent: *

Allow: /wp-admin/admin-ajax.php

Disallow: /wp-admin/

Sitemap: https://nurdic.com/sitemap_index.xml

Whether through Rankmath’s General Settings or any other method, when editing your Robots.txt file – account for the following:

  1. The file is case-sensitive and the incorrect use of letter case may lead to invalid instructions.
  2. The file is made up of groups. The first group, as seen through the Nurdic.com robots.txt file example contains:
    1. A comment beginning the # symbol. The comment will remain invisible to search engine bots, but provide context to humans reviewing the code later: # Ensures the ahrefs bots can crawl the website
    2. User-agent: expression followed by the name of the bot or in the example above, AhrefsSiteAudit.
      1. If you want to specify multiple bots, you must start a new row for every additional bot with the now familiar expression: User-agent: followed by the name of the bot (i.e. AhrefsBot)
  3. On the following line(s), use Allow: and/or Disallow: to explicitly mark certain pages as available or unavailable to the listed bot(s) for crawling:
    1. Although the order of said instructions is not important for the Googlebot, the best practice is to use the most specific rule first to ensure maximal syntax compatibility with the other search engine bots.
      1. So, if there is any URL path exceptions to a more generic rule, it must come first: Disallow: /wp-admin/
      2. This is followed by the more generic rule – in the provided example – all URLs: Allow: /
  4. To separate groups you must use blank lines.
    1. So, if you have different rules for different bots or groups of bots, you must first list the bots and the rules for the first one as one group.
    2. Then, include a blank line to separate the instructions for the first group from the instructions for the second one.
    3. Below, list the bots followed by the instructions they’re meant to follow right underneath. This will ensure each set of instructions will apply only to the specified group of bots.
  5. The second group as seen through the same example contains:
    1. The now familiar expression, User-agent: to list a new bot or in this case, * to refer to all remaining bots apart from those specified above that follow their own dedicated instructions.
      1. There is no required order to follow when it comes ordering groups based on any User-agents or the use of * for all User-agents. However, best practice would suggest to place the User-agent: * group last in order to make the file more understandable to humans.
    2. This is followed by the instructions for the second group of bots, in this case * or all search engine bots apart from those listed in the first group.
      1. In a now familiar fashion, the more specific instruction(s) or the exception to the rule comes first: Allow: /wp-admin/admin-ajax.php
      2. This is followed by the remaining more generic instruction(s): Disallow: /wp-admin/
  6. If you decide to include the location of the XML sitemap or XML sitemap index on the last line of the robots.txt file, which is highly advisable – you’ll need to separate the last group from the XML sitemap using a blank line, just like separating regular groups.
  7. Usually placed on the very last line of the file is the absolute URL to the XML sitemap or XML sitemap index. This is arguably the best method to ensure all search engine bots discover your XML sitemap:
    1. Sitemap: https://nurdic.com/sitemap_index.xml
  8. To ensure your robots.txt file works as intended, you must use a robots.txt tester. If your primary target as a search engine is Google Search, then Google Search Console Robots.txt Report will help you accomplish this.

Additional Robots.txt Considerations

  1. Allow or Disallow entire subdirectories using: Disallow: /do-not-crawl/
  2. Allow or Disallow all URLs that contain a character or string of characters using: Disallow: /*?
    1. The / specifies all URLs the bot will be able to find belonging to the domain or subdomain and the particular protocol the robots.txt file is placed at the root of.
    2. The * in this context is used by Googlebot to identify and separate all URLs that will use the character or string of characters immediately after it (i.e. ?) and enforce the Disallow: rule over those URLs.
    3. The ? use in URLs is reserved for defining parameter-based URLs. All parameter-based URLs will contain the ?, but no other URLs will be able to use it. This makes the ? unique to all parameter-based URLs and by extension allows us to use it as a reference point to block all parameter-based URLs on the website from crawling.
  3. One other character often used in the robots.txt file is the $, employed to specify the end of the URL and most commonly used to create crawling instructions to Allow or Disallow all URLs based on file types, in the following way:
    1. Disallow: /*.pdf$
    2. Allow: /*.jpg$

Practical Uses of robots.txt

Beyond blocking simple pages like a CMS login, robots.txt is most valuable for preventing crawl budget waste by blocking thousands of automatically-generated, low-value URLs.

An e-commerce site’s filters can create countless URL variations from different combinations. A prime example is sorting. A user sorting a category page by price might generate a URL with a parameter like ?sort=price_asc.

While the sorting order changes, the products on the page remain the same. These pages are near-duplicates that offer no unique value to search engines. By adding a simple Disallow rule to block these sorting parameters, you can save valuable crawl budget and focus search engine attention on your unique category and product pages.

robots.txt vs. the Live Search Engine Index

A robots.txt file only controls future crawling; it does not remove pages that are already in a search engine’s live index. If a URL is already indexed, adding a Disallow rule for it in robots.txt will not remove it from search results. In fact, this action prevents Google from recrawling the page to see any new instructions, effectively trapping the old version in the index.

To properly remove a URL from the live index, you must:

  1. Allow Crawling: Ensure the URL is not blocked in your robots.txt file.
  2. Add a noindex Tag: Place a noindex robots meta tag in the page’s HTML <head> section.

This allows the search engine bot to revisit the page, read the noindex directive, and then remove the page from the live index.

Robots.txt Security

The robots.txt file is distinct from on-page directives like the nofollow or noindex attribute of the robots meta tag (placed in the <head> section). A key difference is that while search engines like Google may treat nofollow as a hint they can choose to ignore, they do not generally ignore rules in the robots.txt file.

The robots.txt file instructs search engines which website pages to avoid, primarily to manage crawl budget. Crucially, the file is public and should never be used as a security measure. While reputable crawlers like Googlebot respect its rules, malicious bots can ignore them and use the file to discover the exact URLs you intend to hide.

Robots meta tags and X-Robots-Tag HTTP Header

The meta robots tag provides page-specific instructions for search engines. Unlike the site-wide robots.txt file, this tag is placed within the <head> section of a page’s HTML. For a search engine to see and obey a meta robots tag, the page must be crawlable. If a page is blocked by robots.txt, the bot will never visit it to read the tag’s instructions.

The primary directive that directly influences crawling behavior is the nofollow attribute of the robots meta tag.

  • nofollow: This instructs search engines not to follow any of the links on the page to discover new URLs. The crawler will still see the page itself, but it won’t use it as a pathway to other content.
  • follow: ✅ This is the default behavior. It tells search engines they are free to follow links on the page to discover other content. You do not need to add this tag, as it’s the assumed behavior if nofollow is not present.

It’s crucial to understand that other common directives like noindex, noarchive, and nosnippet do not control crawling. They are Indexing directives that instruct search engines on how to process or display a page in search results after it has already been crawled.

The X-Robots-Tag is an HTTP header that serves the exact same purpose as the meta robots tag. The key difference is that because it’s sent as part of the server’s response header for a URL, it can be applied to any type of file, not just HTML pages. This makes it the only way to control crawling and indexing for non-HTML files like PDFs, images, or documents, which don’t have an HTML <head> section to place a meta tag in.

When a bot requests a URL from your server, the server will send back the X-Robots-Tag along with the file. To tell crawlers not to follow any links within a document (i.e. PDF) : X-Robots-Tag: nofollow. Implementation of the X-Robots-Tag is done on the server-side, most commonly within your site’s .htaccess file.