Robots.txt is a file but not HTML, which is placed on the site to depict the search robots and designate the pages, which the user would not like them to visit. This is definitely a mandate for the Search Engines, but usually, the Search Engines obey what they are asked not to do. At the very primary stage, it needs to be clarified that robots.txt is not something like a firewall or password protection that would prevent the search engines from crawling the site. So, a warning note here also states that if the data stored is really sensitive, then depending on the robots.txt is not really a good idea, to get it protected from being indexed and to get it displayed in search results.
Why Is It Essential for SEO?
The location of the robots.txt is definitely an important component for SEO. It is essential to be in the main directory else the user agents or the search engines would not be able to detect it. Instead of searching for a file named robots.txt in the whole site, they would primarily search for it in the main directory, and if it is found absent there, then they would assume that the site doesn’t have the robots.txt, thus will index everything they find along the way. So, if the robots.txt are not placed in the right place, then the search engines would definitely index the whole site.
How Do They Work?
The Search Engines send out tiny programs called ‘robots’ or ‘spiders’ to search the site and bring information back to the search engines so that the pages of the site can be indexed in the search results and found by web users. The Robots.txt file instructs these programs not to search pages on the site which the users designate using a “disallow” command. One such instance is provided below:
…would block all search engine robots from visiting the following page on the website:
Notice that before disallowing the command, the user has the command:
The “User-agent:” part specifies which robot the user wants to block and could also read as follows:
This command would only block the Google robots, while other robots would still have access to the page:
However, by using the “*” character, the user would be specifying that the commands below it refer to all robots. Your robots.txt file would be located in the main directory of your site. For example:
Some Pages Are Needed To Be Blocked. Why?
There are certain reasons why a user might want to block a page with the aid of Robots.txt file. They are:
If the user has a page on the site which is a replica of another page, then he/she wouldn’t definitely want to index because that would lead to the formation of the content that might affect the SEO.
The user might have a page on the site which might not be applicable to use unless they are specifying a distinct action.
For instance, if there is a ‘thank you’ page where the users would get access to specific information owing to the provision of email id, then there is no need to find that page through a Google search.
The third instance states while protecting private files in the site such as the cgi-bin and keep the files and for keeping the bandwidth from being used up because of the robots indexing the image files.
In all of these above cases, the users need to include a command in the Robots.txt file that tells the search engine spiders not to access that page, not to index it in search results and not to send visitors to it.
Creation Of The Robots.txt File:
This can be easily done by setting up a free Google Webmaster tools account and after this the creation of a Robots.txt file by selecting the ‘crawler access’ option under the ‘site configuration’ option on the menu bar. After reaching there, the user can select ‘generate robots.txt’ and set up a simple Robots.txt file. For instance:
This process states that the user should leave the ‘http://www.yoursite.com’ part of the URL off. For instance, to block the following pages, the process includes:
After this, the user would type the following into the ‘directories and files’ field in the Google Webmaster tools:
After adding these for all the robots and ensuring clicking on the ‘add rule’, the user would ultimately end up with a Robots.txt that would look something like:
Here, it is noted here that there is a default ‘Allow’ command which become useful if the user wants to make an exception and allow a robot to access a page which the user has blocked using a command like:
By placing the command:
Installation Of The Robots.txt. File:
Once the user gets access to the Robots.txt file, it can be uploaded to the main(www) directory in the CNC area of the website. For this, an FTP program, for instance, Filezilla can be used. The second option states hiring a web pro by letting known the pages which the user thinks should be blocked.
Lastly, It is important to update the Robots.txt file if the user has planned for adding pages, files or directories to the site that he/she doesn’t wish to be indexed by the search engines or accessed by web users. This will ensure the security of the website and the best possible results will be yielded out through the user’s search engine optimization.