Role of Robots.txt file in a Well Designed Site
Role of Robots.txt file in a Well Designed Site
What is a Robot txt file?
It is a text file with specific instructions to decide how a search engine is going to index pages from your site. This is the first file approached by robot in any web page and universally its default name is robots.txt. it has two fields; user-agent and disallow.
User-agent specifies robot name for which access policy is defined in Disallow field which specifies specified urls in your web page which robots would have no access to.
This file can be used to ban all or few robots for all or selected pages on the site. Using
User-agent: * and Disallow: / disallows all robots for all pages on the web. User-agent: MSNbot Disallow: and User-agent: * Disallow: /concepts/new/ is example where multiple commands have been given to ban different robot to different pages on the site.
Recommended use of robots.txt file:--
1. It is always named in lowercase syntax.
2. wildcard can be used in one of the field only and not both. Only exception is Google bot that can support wild card file extensions.
3. This file is not obligatory to have for any web site. This is required only where some pages on the site are required to be excluded for indexing by search engine.
4. A domain can have only one robots.txt file.
5. In cases where webmasters can not make this file, the purpose has be achieved by configuring Robots meta tag file. However, some robots skip it and it does not deliver the same functionality.
6. Access to different user agents should be specified in separate lines. However both user-agent and disallow function can be used any number of times.
7. File names on unix operating systems are case sensitive. All content in this file has to be in lower case syntax.
Use of the robots.txt file
1. this file is primarily used to place specific instructions for robots to guide them which pages of file to index. As per global practices, this is entry point for any robot to the site and any good search engine will not violate this standard.
2. It is also helpful to keep out unwanted robots suspected to spamming or collecting email address or image stripping.
3. It is convenient way to keep out private directories on the site excluded from indexing.
4. A site with no robots.txt file can have problems with indexing especially in cases or error 404 redirection.
5. This file is very helpful for multilingual sites to direct robot looking for specific language content only.
6. This file is also helpful in preventing duplicate content from getting indexed and stop from deluging servers with rapid fire requests.
Drawbacks of robots.txt file
Inept handling of this file may allow robots to index pages that you may wish to keep as classified. Robots may snoop around the places otherwise desired to be kept private. Some snooping robots may find from robots file the pages where you have classified content and instead straight land there using direct address. This can be prevented.
Following precautions may be observed while designing robots.txt file:
1. Put the file in subfolder in the folder barred by robots file. Another way to safe guard this information is that the files with classified content can be password protected.
2. Always put one URL in one disallow line. You can have as many lines as required.
3. When only one file is to be disallowed write complete file name with extension without a forward slash. While where a whole directory is to be disallowed end the line with a forward slash and need not mention all the files in the directory.
4. Keep the robots.txt file in main root directory, else robots may not access it.
5. Use lower case letters only.
6. To optimize the site, first mention the robots you want to avoid altogether followed by robots for which only specific content is to be banned.
7. Avoid adding comment after the syntax command ‘#’. Add new lines for comments.
