Robots.txt: The deceptively important file all websites need

The robots.txt file helps major search engines understand where to go on your website.

Although the major search engines support the robots.txt file, they may not all follow the rules in the same way.

Below we explain what a robots.txt file is and how you can use it.

What is a robots.txt file?

Every day, bots visit your website – also known as robots or spiders. Search engines like Google, Yahoo, and Bing send these bots to your website to have your content crawled, indexed and shown in search results.

Bots are a good thing, but there are some times when you don’t want the bot to be walking around your website crawling and indexing everything. This is where the robots.txt file comes in.

By adding certain instructions to a robots.txt file, you are telling the bots to crawl only the pages that they want to be crawled.

However, it is important to understand that not every bot obeys the rules that you write in your robots.txt file. For example, Google doesn’t listen to instructions you put in the crawl frequency file.

Do you need a robots.txt file?

No, a website does not require a robots.txt file.

When a bot comes to your website and doesn’t have one, it will simply crawl your website and index the pages as usual.

A robot.txt file is only needed if you want more control over what is crawled.

Some advantages of having one are:

  • Help manage server overloads
  • Prevent crawling waste from bots visiting pages you don’t want them to
  • Keep certain folders or subdomains private

Can a robots.txt file prevent content from being indexed?

No, you cannot prevent content from being indexed and displayed in search results with a robots.txt file.

Not all robots follow instructions the same way, so some may index the content that you have set as not crawled or indexed.

If the content you want to prevent from appearing in search results contains external links, it will also be indexed by search engines.

The only way to ensure your content isn’t indexed is to add a noindex meta tag to the page. This line of code looks like this and will be inserted into the HTML code of your page.

If you want the search engines not to index a page, you must allow the page in the robots.txt file to be crawled.

Where is the robots.txt file located?

The robots.txt file is always in the root domain of a website. As an example, you can find our own file at https://www.hubspot.com/robots.txt.

Most websites should have access to the actual file so that you can edit it in an FTP or access the file manager on your host’s CPanel.

On some CMS platforms you can find the file directly in your administration area. HubSpot, for example, makes it easy to customize your robots.txt file through your account.

If you’re using WordPress, you can access the robots.txt file in the public_html folder on your website.

the robots.txt file in the public_html folder on your WordPress website

By default, WordPress includes a robots.txt file on a clean install that contains:

User agent: *

Do not allow: / wp-admin /

Prohibit: / wp-includes /

The above instructs all bots to crawl all parts of the website except for anything in the / wp-admin / or / wp-includes / directories.

However, you may want to make a more robust file. Let’s show you how below.

Uses for a Robots.txt file

There can be many reasons for customizing your robots.txt file – from controlling the crawl budget to blocking sections of a website from being crawled and indexed. Now let’s examine some reasons for using a robots.txt file.

1. Block all crawlers

Blocking all crawlers from accessing your website isn’t something you want to do on an active website, but it’s a great option for a development website. Blocking the crawlers will prevent your pages from showing up in search engines, which is good if your pages aren’t ready to be viewed just yet.

2. Prevent certain pages from being crawled

One of the most common and useful ways to use your robots.txt file is to restrict search engine bots from accessing parts of your website. This will allow you to maximize your crawl budget and prevent unwanted pages from ending up in search results.

It’s important to note that just because you told a bot not to crawl a page, it doesn’t mean it won’t be indexed. If you don’t want a page to appear in search results, you’ll need to add a noindex meta tag to the page.

Example of Robots.txt file instructions

The robots.txt file consists of blocks of directives. Each directive begins with a user-agent, and then the rules for that user-agent are placed below it.

When a certain search engine lands on your website, it looks for the user-agent that applies to it and reads the block that points to it.

There are several directives that you can use in your file. Let’s break these down now.

1. User agent

With the user agent command, you can target specific bots or spiders. For example, if you only want to target Bing or Google, use this instruction.

Although there are hundreds of user agents, the following are examples of some of the most common user agent options.

User agent: Googlebot

User agent: Googlebot image

User agent: Googlebot Mobile

User agent: Googlebot news

User agent: Bingbot

User agent: Baiduspider

User agent: msnbot

User agent: slurp (Yahoo)

User agent: yandex

It’s important to note that user agents are case-sensitive, so be sure to type them correctly.

Wildcard user agent

The wildcard user agent is marked with an asterisk (*) and allows you to easily apply an instruction to any existing user agent. So if you want a specific rule to apply to every bot, you can use this user agent.

User agent: *

User agents only obey the rules that apply to them best.

2. Prohibit

The Disallow directive instructs search engines not to crawl or access certain pages or directories on a website.

The following are some examples of how you can use the Disallow directive.

Block access to a specific folder

In this example, we’re instructing all bots not to crawl anything in the / portfolio directory on our website.

User agent: *

Do not allow: / portfolio

If we just wanted Bing not to crawl this directory, we’d add it like this instead:

User agent: Bingbot

Do not allow: / portfolio

Block PDF or other file types

If you don’t want your PDF or other file types to be crawled, the following instruction should help. We tell all bots that we don’t want crawled PDF files. The $ at the end tells the search engine that it is the end of the URL.

So if I have a pdf file under meinwebsite.com/site/myimportantinfo.pdf, the search engines do not access it.

User agent: *

Do not allow: * .pdf $

For PowerPoint files, you can use:

User agent: *

Do not allow: * .ppt $

A better option might be to create a folder for your PDF or other files and then prohibit the crawlers from crawling it and index the entire directory with a meta tag.

Block access to the entire website

Especially useful if you have a development site or test folder, this directive tells all bots not to crawl your site at all. Make sure to remove this when you go live with your website or you will experience indexing issues.

User agent: *

The * (asterisk) shown above is a so-called “wildcard” expression. When we use an asterisk, we imply that the following rules should apply to all user agents.

3. Allow

The “allow” directive allows you to specify specific pages or directories that bots can access and crawl. This can be an override rule for the lock option shown above.

In the following example, we’re telling the Googlebot that we don’t want the Portfolio directory to be crawled, but we want a specific Portfolio item to be accessed and crawled:

User agent: Googlebot

Do not allow: / portfolio

Allow: / portfolio / crawlable portfolio

4. Sitemap

Including the location of your sitemap in your file can make it easier for search engine crawlers to crawl your sitemap.

Submitting your sitemaps directly to each search engine’s webmaster tools eliminates the need to add them to your robots.txt file.

Sitemap: https://ihrewebsite.com/sitemap.xml

5. Crawl delay

The crawl delay can tell a bot to slow down the crawling of your website so that it doesn’t overload your server. The following instruction example asks Yandex to wait 10 seconds after each crawling action on the website.

User agent: yandex

Creep delay: 10

This is a guideline to be careful with. On a very large website, it can greatly minimize the number of URLs crawled each day, which would be counterproductive. However, this can be useful on smaller websites where the bots visit a little too much.

NOTE: There is crawl delay not supported by Google or Baidu. If you want to ask their crawlers to slow down your website crawling, you need to do it through their tools.

What are regular expressions and wildcards?

Pattern matching is a more advanced way of controlling the way a bot crawls your website using characters.

There are two general terms used by both Bing and Google. These instructions can be especially useful on ecommerce websites.

Asterisk: * is treated as a wildcard and can represent any character string

Dollar sign: $ is used to mark the end of a URL

A good example of using the wildcard * is in the scenario where you want to prevent the search engines from crawling pages that might contain a question mark. The following code instructs all bots to ignore crawling URLs with a question mark.

User agent: *

To forbid: /*?

How to create or edit a Robots.txt file

If you don’t have a robots.txt file on your server, you can easily add one using the following steps.

  1. Open your favorite text editor to create a new document. Common editors that may be present on your computer are Notepad, TextEdit or Microsoft Word.
  2. Add the instructions you want to add to the document.
  3. Save the file under the name “robots.txt”
  4. Test your file as shown in the next section
  5. Upload your .txt file to your server via FTP or in your CPanel. How you upload it depends on the type of website you have.

In WordPress, you can use plugins like Yoast, All In One SEO, Rank Math to generate and edit your file.

You can also use a robots.txt generator tool to prepare a tool that will minimize errors.

How to test a Robots.txt file

Before going live with the robots.txt file code you created, you should run it through a tester to ensure that it is valid. This will help avoid problems with incorrect instructions that may have been added.

The test tool robots.txt is only available in the old version of the Google Search Console. If your website isn’t connected to Google Search Console, you need to do that first.

Go to the Google support page and then click the “Open Robots.txt Tester” button. Select the property you want to test for and you will be taken to a screen like the one below.

To test your new robots.txt code, simply delete the contents of the box, replace it with your new code and click “Test”. If the answer to your test is “allowed”, your code is valid and you can revise your current file with your new code.

the robots.txt tester at Google support

Hopefully this post has made you less afraid to dig into your robots.txt file – as this is a way to improve your ranking and boost your SEO efforts.

SEO starter package

Leave a Reply

Your email address will not be published. Required fields are marked *