27 June 2018
Sometimes, it can be incredibly useful to be able to block robots from crawling your site. In this blog, we’re going to explain why, and how you can use a handy robots.txt file to achieve it.
Many savvy web developers block robots from crawling parts of websites that aren’t supposed to be seen by users. They’re also particularly useful when a site is being built, as it’s not ready for indexing yet, so you or your web developer can block all bots until it’s perfect and ready for viewing.
Limiting the number of pages needing to be assessed on your site can be helpful because it saves the bot time. It doesn’t have to view pages that aren’t helpful for ranking, meaning it can go deeper into the useful pages of your site. Bots crawl on a budget, and it’ll work in your favour to save them some time.
You might want to tell bots not to follow routes through your site for lots of different reasons. For example, if you’ve got pages that are for you and your staff only, you don’t need them showing up in search engine rankings as they’re not helpful for your customers. Or, you might want to keep certain images or files out of the rankings, or have some duplicate content you don’t want messing up your SEO. These are all great reasons to block certain pages from being crawled.
Not all websites need to block robots from crawling any parts of the site, so don’t worry if you don’t have one. If all your pages are helpful and you want to rank for them, let the bots be free to index it all.
A robot.txt file on your website will advise bots not to help themselves to your content for indexing. It’s a guideline that all friendly – and lots of unfriendly – bots will see when they first encounter your site. They don’t have to obey it, but good indexing bots from Google and other search engines will.
Just make sure it’s on your main domain (your homepage) because that’s where bots will look first. If it’s housed elsewhere on your site, it’s probably going to be ignored by all bot traffic on your site.
Specify different rules in your robot.txt file to control which bots can index different parts of your site. You can block them completely, though that’s a bad move for your SEO, or you can use the file to guide the bot to the most useful parts of your website. Alternatively, you can stop them from crawling pages you don’t want users to see.
Here’s a quick example of what a robot.txt file could look like:
The * stands in place of a specific bot. Any bot that comes to your site will know that this instruction is for them. Having nothing specified after “Disallow:” lets them know all your site is to be crawled – nothing is off limits.
This will ask that all bots not crawl anything that’s in your “/database/” folder. Of course, some could if they weren’t of a friendly nature, as these are only guidelines you’re setting out for them.
Here, you’re asking that bots don’t crawl sites beginning with a certain URL. So, if you wanted to keep them out of files like “/example/,” here’s where you’d ask.
If there’s a page you would like to be crawled by helpful bots nestled among ones that you don’t want them crawling, you can use an “allow” exception to let them know there’s a page they can see in that group.
User-agent: spam bot
Have you found you’ve been getting a lot of spam traffic coming through from a certain bot? If so, you can ask it to stop. In the example above, we’ve used “spam bot” as an example, but you can use should use the name of the specific bot that’s bugging you and ask it not to look at anything.
Having the option to block robots from crawling your site can be handy in certain situations. If you’ve got pages you want to keep private, or ones that aren’t ready to be indexed and ranked, you can ask bots not to crawl them. It’s a guide that friendly bots will obey, but other malicious crawlers might push on regardless.
If you’d like to talk about getting some bot blocks in place on your website, let us know – we’d be happy to help. Just get in touch.