Nothing is for sale here. Freewill tips keep the site running. Want to help? → Tip via Paypal

Robots.txt Tutorial

A robots.txt file is a simple text file that provides instructions to search engine robots, or spiders, that most search engines will honor. This tutorial will show you how to keep robots from indexing images, web pages, and directories.

To create a robots.txt file open a text editor such as Notepad and on a new, blank page, type:

User-agent: *
Disallow: /software/
Disallow: /somePage.html

Here’s what those three lines mean:

User-agent: *
The “user-agent” tells the search engine robots if the directives that follow apply to them. You can name individual agents, but that isn’t very practical because there are hundreds. Using the asterisk (*) is a wild card. That means the directives apply to ALL user agents. It's important to note not all user agents will obey a robots.txt file, but most will.

Disallow: /software/
This tells the robots they should not index the content of the software directory.

Disallow: /somePage.html
This tells the robots they should not index the specific web page listed.

There are a few other items you might want to use. Let’s have a look…

# This is a comment in a robots.txt file

You can make comments in your robots.txt file to remind yourself of something later. A comment is preceded by a hash mark. It can be useful at times to make notes to remind yourself of why you made some entries.

If you have a robot that is hitting your website too often you can disallow that specific robots. Not all will obey, but the many do.

User-agent: Badbot
Disallow: /

Be absolutely sure you name the abusive robot by name—DO NOT use the wildcard with this disallow because this disallow tells the robot not to index anything! Note that "Badbot" is NOT an actual robot name, it's a token substituted for a real robot name. You'll have to check your site's stats to see how often the robots are hitting it, and get the name of any abusive robot from there.

Partially Effective The following commands are not honored by all spiders, so don’t count on them to be fully effective. You can disallow specific file types.

Disallow: /.gif$
Disallow: /.jpg$
Disallow: /.htm$

With that, robots are prevented from indexing GIF and JPG images, as well as any web page with a .htm extension. The dollar sign at the end denotes the end of the file type/directive.

If you usually make your pages as .html, you can give any pages you don’t want indexed a .htm extension and keep some robots from indexing them with just one robots.txt file entry. Otherwise you can list each page you don’t want individually.

If user agents are requesting files too quickly, you can instruct them how long to wait between requests.

User-agent: *
Crawl-delay: 10

In that example, robots are told to wait 10 second between requests. Be careful using this as many (if not all) robots are only allowed so much time per site. If you set the delay too long it could mean less content gets indexed.

Some robots also obey the Sitemap command. Listing your sitemap in your robots.txt file is a good way to get your Sitemap indexed.

Sitemap: http://www.i-webhost.com/sitemap_index.xml

Once you have your robots.txt file made, save it as "robots.txt" and upload it to the root directory of your website. The search engine robots should find it on their own. Note that it can take some time, they are not likely visiting your site every day.

warning This is an extremely important point—do NOT list any pages you don't want other people to know about in your robots.txt file!!!

Why not, you ask?

Because anyone can enter:

http://www.yourDomain.com/robots.txt …into their browser's address bar and view your robots.txt file.

If you list pages you don't want people to find, you're just advertising them to the robots.txt snoops. Some people do go around looking at robots.txt files trying scavenge free products or trying to find what "secrets" they can uncover. No sense in making it easy for them.

The following warning comes straight from Google, but it likely applies to all search engines:

Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results.

If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex.

If you've ever seen this in the search result pages:

"We would like to show you a description here but the site won't allow us."

...or a similar wording depending on the search engine, that page was likely indexed from a link to it but was blocked in a robots.txt file. So, if your web page is blocked with a robots.txt file, its URL can still appear in search results, but the search result will not have a description. Image files, video files, PDFs, and other non-HTML files will be excluded.

Using a robots.txt file is one layer of protection. See the Robots Meta Tag tutorial to add another layer of protection.

me, not me
Because every web page needs a little color and eye candy, here's a picture of me after I got a haircut and shaved my beard. I clean up pretty good for an old guy, don't you think?

Guys, don't be jealous. Not everyone can be born with natural good looks.