SearchImagesVideo

    Using robots.txt

    1. User-agent directive
    2. Using Disallow and Allow directives
    3. Using special characters "*" and "$"
    4. Sitemap directive
    5. Host directive
    6. Additional information
    7. Crawl-delay directive
    8. Clean-param directive
    9. What is a robots.txt file?
    10. How to create robots.txt
    11. Exceptions
    1. User-agent directive

      You can control Yandex robot's access to your site using robots.txt file that must reside in the root directory. Yandex robot supports the http://www.robotstxt.org/wc/norobots.html specification with extensions described below.

      The work of Yandex robot is session-based. In each session, the robot generates a pool of pages that it plans to download. The session starts with downloading the robots.txt file of the site. If the file is missing or the response to the robot's request is anything different from HHTP code '200', the robot takes it as a sign that the access is not restricted in any way. The robot checks the file robots.txt for entries starting with 'User-agent:', searching for the following substrings: 'Yandex' or '*' (case insensitive). If 'User-agent: Yandex' is found, directives for 'User-agent: *' are disregarded. If the 'User-agent: Yandex' and 'User-agent: *' entries are missing, the robot takes it as a sign that the access is not restricted in any way.

      You can specify the following directives targeting specific Yandex robots:

      • 'YandexBot' — the main indexing robot;
      • 'YandexMedia' — the robot that indexes multimedia data;
      • 'YandexImages' — indexer of Yandex.Images;
      • 'YandexCatalog' — Yandex.Catalog robot;
      • 'YandexDirect' — the robot that indexes pages of the sites that participate in Yandex advertising network
      • 'YandexBlogs' — the blog search robot that indexes comments to the posts;
      • 'YandexNews' — Yandex.News robot;
      • 'YandexPagechecker' — a robot that accesses the page when microformats are validated using the Microformats validator form.
      • ‘YandexMetrika’Yandex.Metrica robot;
      • ‘YandexMarket’— Yandex.Market robot;
      • ‘YandexCalendar’ — Yandex.Calendar robot.

      For all of those, the rule applies: if directives for a specific robot are found, the 'User-agent: Yandex' and 'User-agent: *' directives are disregarded.

      Example:

      User-agent: * # this directive will not be used by Yandex robots
      
      Disallow: /cgi-bin 
      
      
      
      User-agent: Yandex # this directive will be used by all Yandex robots
      
      Disallow: /*sid= # except for the main indexing one
      
      
      
      User-agent: YandexBot # will be used only by the main indexing robot
      
      Disallow: /*id=
    2. Using Disallow and Allow directives

      Use the 'Disallow' directive to restrict the robot access to specific parts of the site or to the entire site. Examples:

      User-agent: Yandex
      
      Disallow: / # blocks access to the entire site
      
      
      
      User-agent: Yandex
      
      Disallow: /cgi-bin # blocks access to the pages 
      
                         #with paths starting with '/cgi-bin'

      Note:

      It is not acceptable to have empty line breaks between 'User-agent' and 'Disallow' ('Allow') directives, as well as between different 'Disallow' ('Allow') directives.

      In addition to that, it is recommended, according to the standard, to insert an empty line break before each 'User-agent' directive.

      The '#' character is used for comments. Everything following this comment, up to the first line break, is disregarded.

      Use the 'Disallow' directive to restrict the robot access to specific parts of the site or to the entire site. Examples:

      User-agent: Yandex
      
      Allow: /cgi-bin
      
      Disallow: /
      
      # disallows to download anything except for the pages 
      
      # with paths that begin with '/cgi-bin'

      Using directives jointly

      If several directives match a particular site page, the first one that appears in the selected User-agent block is selected. Examples:

      User-agent: Yandex
      
      Allow: /cgi-bin
      
      Disallow: /
      
      # disallows to download anything except for the pages 
      
      # with paths that begin with '/cgi-bin'
      User-agent: Yandex
      
      Disallow: /
      
      Allow: /cgi-bin
      
      # disallows to download anything at all from the site

      Using Disallow and Allow directives without parameters

      If there are no parameters specified for a directive, this is interpreted as follows:

      User-agent: Yandex
      
      Disallow: # same as Allow: /
      
      
      
      User-agent: Yandex
      
      Allow: # same as Disallow: /
    3. Using special characters "*" and "$"

      You can use special characters '*' и '$' for path specification in Allow-Disallow directives, specifying certain regular expressions this way. The '*' special character stands for any (including empty) character sequence. Examples:

      User-agent: Yandex
      
      Disallow: /cgi-bin/*.aspx # disallows access to '/cgi-bin/example.aspx'
      
                                # and '/cgi-bin/private/test.aspx'
      
      Disallow: /*private # disallows not only  '/private',
      
                          # but also '/cgi-bin/private'

      '$' special character

      By default '*' is appended to the end of each rule contained in robots.txt, for example:

      User-agent: Yandex
      
      Disallow: /cgi-bin* # blocks access to the pages 
      
                          # with paths that begin with '/cgi-bin'
      
      Disallow: /cgi-bin # the same

      To cancel '*' appended to the end of the rule, use the '$' special character, for example:

      User-agent: Yandex
      
      Disallow: /example$ # disallows '/example', 
      
                          # but does not disallow '/example.html'
      User-agent: Yandex
      
      Disallow: /example # disallows both '/example' 
      
                         # and '/example.html'
      User-agent: Yandex
      
      Disallow: /example$ # only disallows '/example'
      
      Disallow: /example*$ # similar to  'Disallow: /example' 
      
                           #disallows both /example.html and /example
    4. Sitemap directive

      If you use a Sitemap XML file to share information about the URLs on your site available for indexing, and would like to share this information with our indexing robot, please provide the location of the sitemap file in the 'Sitemap' (include all the files if you have more than one) directive of your robots.txt:

      User-agent: Yandex 
      
      Allow: / 
      
      Sitemap: http://mysite.ru/site_structure/my_sitemaps1.xml 
      
      Sitemap: http://mysite.ru/site_structure/my_sitemaps2.xml

      or

      User-agent: Yandex 
      
      Allow: / User-agent: * 
      
      Disallow: / 
      
      Sitemap: http://mysite.ru/site_structure/my_sitemaps1.xml 
      
      Sitemap: http://mysite.ru/site_structure/my_sitemaps2.xml

      Our indexing robot will remember the location of your sitemaps.xml file and process the contents each time it visits your site.

    5. Host directive

      If your site has mirrors, a special mirroring robot will locate them and generate a group of mirrors for your site. Only the main mirror will participate in the search. You can specify it for all the mirrors in the robots.txt file using the 'Host' directive. Specify the name of the main mirror as the directive parameter. The 'Host' directive does not guarantee that the specified main mirror will be selected. However, the decision making algorithm takes it into account with high priority. Example:

      #Let's assume that www.main-mirror.com is the main mirror of the site. Then,  
      
      #robots.txt for all the sites from the mirror group will look as follows: 
      
      User-Agent: *
      
      Disallow: /forum
      
      Disallow: /cgi-bin
      
      Host: www.main-mirror.com

      THIS IS IMPORTANT: To achieve compatibility with robots that somewhat deviate from standard behaviour when processing robots.txt, the 'Host' directive must be added to the group that starts from the 'User-Agent' entry, right after the 'Disallow'('Allow') directive(s). The 'Host' directive takes as an argument a domain name with port number (80 by default), separated by a colon.

      # Example of a well-formed robots.txt, during parsing  
      
      # of which the Host directive will be taken into account
      
      User-Agent: *
      
      Disallow:
      
      Host: www.myhost.ru

      However, the Host directive is an intersectional one, so it will be used by the robot regardless of its location in robots.txt.

      THIS IS IMPORTANT: Only one Host directive is allowed in robots.txt. If several directives are specified, only one of them will be used.

      Example:

      Host: myhost.ru # used
      
      
      
      User-agent: *
      
      Disallow: /cgi-bin
      
      
      
      User-agent: Yandex
      
      Disallow: /cgi-bin
      
      Host: www.myhost.ru # not used

      THIS IS IMPORTANT: the parameter of the Host directive must contain one well-formed host name (i.e. the one compliant with RFC 952 and not an IP address) and a valid port number. Badly formed 'Host:' lines will be ignored.

      # Example of Host directives that will be ignored
      
      Host: www.myhost-.ru
      
      Host: www.-myhost.ru
      
      Host: www.myhost.ru:100000
      
      Host: www.my_host.ru
      
      Host: .my-host.ru:8000
      
      Host: my-host.ru.
      
      Host: my..host.ru
      
      Host: www.myhost.ru/
      
      Host: www.myhost.ru:8080/
      
      Host: http://www.myhost.ru
      
      Host: 213.180.194.129
      
      Host: www.firsthost.ru,www.secondhost.ru
      
      Host: www.firsthost.ru www.secondhost.ru

      Examples of using Host directive:

      # if domain.myhost.ru is the main mirror for 
      
      # www.domain.myhost.ru, then the correct usage of  
      
      # Host directive will be as follows:
      
      User-Agent: *
      
      Disallow:
      
      Host: domain.myhost.ru
      
      
      
      # if domain.myhost.ru is the main mirror for 
      
      # www.domain.myhost.ru, then incorrect usage of  
      
      # Host directive will be as follows:
      
      User-Agent: *
      
      Disallow:
      
      Host: myhost.ru
    6. Additional information

      Yandex robot does not support directives for robots.txt that are not mentioned in the present document.

      Please take into account that the result of using robots.txt format extensions may be different from the result obtained not using the extensions, i.e.:

      User-agent: Yandex	
      
      Allow: /
      
      Disallow: /
      
      # when extensions are not used, this disallows everything because 'Allow: /' was ignored, 
      
      # while, when the extensions are supported, everything is allowed
      
      
      
      User-agent: Yandex
      
      Disallow: /private*html
      
      # when extensions are not used, this disallows '/private*html', 
      
      # and when extensions are supported, this also disallows '/private*html', 
      
      # '/private/test.html', '/private/html/test.aspx' etc.
      
      
      
      User-agent: Yandex
      
      Disallow: /private$
      
      # when extensions are not used, this '/private$', '/private$test', etc. 
      
      # and when extensions are supported, this only disallows '/private'
      
      
      
      User-agent: *
      
      Disallow: /
      
      User-agent: Yandex
      
      Allow: /
      
      # when extensions are not supported, then, because there is no line break, 
      
      # 'User-agent: Yandex' would be ignored and  
      
      # the result would be 'Disallow: /', but Yandex robot  
      
      # identifies entries because of the  'User-agent:' substring, 
      
      # and the result for Yandex robot in this particular case is 'Allow: /'
      
      
      
      User-agent: *
      
      Disallow: /
      
      # comment1...
      
      # comment2...
      
      # comment3...
      
      User-agent: Yandex
      
      Allow: /
      
      # similar to the previous example (see below)

      Examples of usage for robots.txt extended format:

      User-agent: Yandex
      
      Allow: /archive
      
      Disallow: /
      
      # allows everything in '/archive' and disallows all the rest
      
      
      
      User-agent: Yandex
      
      Allow: /obsolete/private/*.html$ # allows html files 
      
                              # with the path of  '/obsolete/private/...'
      
      Disallow: /*.php$  # disallows all '*.php' on this site
      
      Disallow: /*/private/ # disallows all subpaths containing 
      
                            # '/private/', but the Allow directive located higher countermands 
      
                            # a part of this restriction
      
      Disallow: /*/old/*.zip$ # disallows all '*.zip' files whose path contains  
      
                              # '/old/'
      
      
      
      User-agent: Yandex
      
      Disallow: /add.php?*user= 
      
      # disallows all 'add.php?' scripts with a parameter of  'user'

      When creating robots.txt file, remember that there is a limit on the size of the file the robot can process. The robots.txt files that are too big (over 32 KB) are interpreted as unrestricting, i.e. they are regarded as equivalent to the following:

      User-agent: Yandex
      
      Disallow:

      The robots.txt files that the robot was unable to download (for example, due to incorrect http headers) or those that return error 404, are regarded as unrestricting.

      To validate robots.txt file you can use a special online analyzer. See the description of «robots.txt analyzer».

    7. Crawl-delay directive

      If the server is overloaded and does not have enough time to process downloading requests, use the Crawl-delay directive. It enables you to specify the minimum interval (in seconds) for a search robot to wait after downloading one page, before starting to download another. To achieve compatibility with robots that somewhat deviate from standard behaviour when processing robots.txt, the Crawl-delay directive must be added to the group that starts from the 'User-Agent' entry, right after the 'Disallow'('Allow') directive(s).

      Yandex search robot supports fractional values for Crawl-Delay, e.g. 0.5. It does not mean that the search robot will access your site every half a second, but it gives the robot more freedom and may speed up the site processing.

      Examples:

      User-agent: Yandex
      
      Crawl-delay: 2 # specifies a delay of 2 seconds 
      
      
      
      User-agent: *
      
      Disallow: /search
      
      Crawl-delay: 4.5 # specifies a delay of 4.5 seconds 
      
      
    8. Clean-param directive

      If your site page addresses contain dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.), you can describe them using the 'Clean-param' directive. Using this information, Yandex robot will not reload duplicating information again. Thus the efficiency of your website processing by robot will increase, and the server load, decrease.

      For example, your site has the following pages:

      www.site.ru/some_dir/get_book.pl?ref=site_1&book_id=123
      
      www.site.ru/some_dir/get_book.pl?ref=site_2&book_id=123
      
      www.site.ru/some_dir/get_book.pl?ref=site_3&book_id=123

      the 'ref=' parameter is only used to track the resource from which the request was sent, and does not change the content. The same book, 'book_id=123', will be displayed at all the three addresses. Then, if you specify the following in robots.txt:

      Clean-param: ref /some_dir/get_book.pl

      like this:

      User-agent: Yandex
      
      Disallow:
      
      Clean-param: ref /some_dir/get_book.pl

      Yandex robot will converge all the page addresses into one:

      www.site.ru/some_dir/get_book.pl?ref=site_1&book_id=123,
      
      

      If a parameterless page is available on the site,

      www.site.ru/some_dir/get_book.pl?book_id=123

      then, after the robot indexes it, other pages will be converted into it. Other pages of your site will be traversed more often, because there will be no need to traverse the following pages:

      www.site.ru/some_dir/get_book.pl?ref=site_2&book_id=123
      
      www.site.ru/some_dir/get_book.pl?ref=site_3&book_id=123

      Syntax for using the directive:

      Clean-param: p0[&p1&p2&..&pn] [path]

      In the first field you list the parameters that must be disregarded, delimited with '&'. In the second field, specify the path prefix for the pages to which the rule must be applied.

      THIS IS IMPORTANT: директива Clean-Param is an intersectional one, so it will be used by the robot regardless of its location in robots.txt. If several directives are specified, all of them will be taken into account by the robot.

      Note:

      A prefix may contain a regular expression in a format similar to the one used in robots.txt, with some restrictions: the only characters allowed are A-Za-z0-9.-/*_. * is interpreted in the same way as in robots.txt. A '*' is always implicitly appended to the end of the prefix, i.e.:

      Clean-param: s /forum/showthread.php

      means that the s will be disregarded for all the URLs that begin with /forum/showthread.php. The second field is optional. If it is omitted, the rule is applied for all the pages of the site. Everything is case sensitive. A rule cannot exceed 500 character in length. For example:

      Clean-param: abc /forum/showthread.php
      
      Clean-param: sid&sort /forumt/*.php
      
      Clean-param: someTrash&otherTrash

      Additional examples:

      # for addresses of the following type:
      
      www.site1.ru/forum/showthread.php?s=681498b9648949605&t=8243
      
      www.site1.ru/forum/showthread.php?s=1e71c4427317a117a&t=8243
      
      #robots.txt will contain:
      
      User-agent: Yandex
      
      Disallow:
      
      Clean-param: s /forum/showthread.php
      # for addresses of the following type:
      
      www.site2.ru/index.php?page=1&sort=3a&sid=2564126ebdec301c607e5df
      
      www.site2.ru/index.php?page=1&sort=3a&sid=974017dcd170d6c4a5d76ae
      
      #robots.txt will contain:
      
      User-agent: Yandex
      
      Disallow:
      
      Clean-param: sid /index.php
      #if there is more than one of such parameters:
      
      www.site1.ru/forum_old/showthread.php?s=681498605&t=8243&ref=1311
      
      www.site1.ru/forum_new/showthread.php?s=1e71c417a&t=8243&ref=9896
      
      #robots.txt will contain:
      
      User-agent: Yandex
      
      Disallow:
      
      Clean-param: s&ref /forum*/showthread.php
      #if a parameter is used in more than one scripts:
      
      www.site1.ru/forum/showthread.php?s=681498b9648949605&t=8243
      
      www.site1.ru/forum/index.php?s=1e71c4427317a117a&t=8243
      
      #robots.txt will contain:
      
      User-agent: Yandex
      
      Disallow:
      
      Clean-param: s /forum/index.php
      
      Clean-param: s /forum/showthread.php
    9. What is a robots.txt file?

      Robots.txt is a text file that resides on your site and is intended for search engine robots. In this file, the web master can specify the site indexing parameters for all the robots or for each search engine in particular.

    10. How to create robots.txt

      Using any text editor (Notepad or Wordpad), create a file named robots.txt and fill it according to the rules presented below. Then place the file in the root catalog of your site.

      To make sure that your robots.txt file will be processed correctly, use the robots.txt file analyzer.

    11. Exceptions

      A number of Yandex robots download web documents for purposes other than indexing (User-agent: *). These robots are not subject to generic robots.txt restrictions (User-agent: *) to avoid being unintentionally blocked by site owners. Some robots.txt restrictions on certain sites may also be ignored if an agreement has been reached between Yandex and the site owners.

      Important: if one of these robots downloads a webpage not normally accessible by the Yandex indexing robot, this webpage will never be indexed and will not appear in our search results.

      List of Yandex robots not subject to standard robots.txt restrictions:

      • YandexDirect downloads ad landing pages to check their availability and content. This is compulsory for placing ads in Yandex search results and YAN partner sites;
      • YandexCalendar regularly downloads calendar files requested by users despite being browsing directories that are blocked from indexing.

      To prevent this behavior, you can restrict access for these robots to some or all of your site using the following disallow robots.txt directives, for example:

      User-agent: YandexDirect
      Disallow: /
      User-agent: YandexCalendar
      Disallow: /*.ics$
    Was this information useful?Thank you!
    Thank you! Your feedback will help us to improve our Help pages.