A Robots Processing Instruction for XML Documents

Draft 2 December 1999, comments to wunder@infoseek.com

Walter Underwood
Infoseek Corporation
6 October 1999

The robots processing instruction ("robots PI") is a simple mechanism to indicate to visiting Web Robots whether a document should be indexed and whether links in the document should be followed. In HTML documents, the Robots META tag (Koster 1996, Raggett 1998 Section B.4.1) serves the same purpose.

This differs from the Standard for Robot Exclusion (Koster 1994, Kollar et al 1996) in that the instructions to the robot are in the document itself instead of in a "/robots.txt" file. An author often does not have permission to change the /robots.txt file stored at the root of the web server, but always has permission to change their own document.

The XML 1.0 specification (Bray et al 1998) says, "Processing instructions (PIs) allow documents to contain instructions for applications." The robots PI is an instruction to robot applications.

"Following links" in XML requires recognizing links in the document. The format for links in XML is a work in process at this writing. See the XLink working draft to track the progress. Eventually, there should be an XLink recommendation. See the W3C's XML Activity for the status of the XLink specification.

Where to put the robots processing instruction

The robots PI should occur once before any indexable text or followable links. It should be in the internal subset (not in an external DTD or parameter entity). Since robots may be non-validating, a robots PI in the external subset might not be seen by the robot.

Here is an example, with a made-up document type:


<?xml version="1.0"?>

<?robots index="no" follow="yes"?>

<!-- we want robots to find the articles, but not index this summary -->

<headlines>

<title>Technology Headlines</title>

<story>

<storytitle xlink:type="simple" 

   xlink:href="tech/xml_number_1.xml">XML Wins Again!</storytitle>

Syntax of the robots processing instruction

The robots processing instruction has a format similar to an XML entity, with two "attributes". The attribute names are index and follow. The value of each attribute can be yes or no . Both attributes must be present, and the index attribute must be first.

Specifying index="yes" instructs an indexing robot that it is OK to index the page. Specifying follow="yes" instructs an indexing robot that it is OK to follow links on the page. If a document does not contain a robots PI, the robot may both index the document and follow the links.

If there is no robots PI the robot may index and follow. If there is only a robots PI with illegal syntax, the robot should report or log the URL and the syntax error, then index and follow. If a document contains more than one robots PI, the first one with legal syntax should be obeyed.

Using the EBNF conventions from the XML 1.0 specification:


RobotsPI   ::= '<?robots' S IndexSpec S FollowSpec '?>'

IndexSpec  ::= 'index="yes"'

             | 'index="no"'

FollowSpec ::= 'follow="yes"'

             | 'follow="no"'

/* definition of white space from XML 1.0 grammar */

S          ::= (#x20 | #x9 | #xD | #xA)+ 

There are exactly four legal forms for the robots PI, ignoring differences in white space. They are:


<?robots index="yes" follow="yes"?>

<?robots index="no" follow="yes"?>

<?robots index="yes" follow="no"?>

<?robots index="no" follow="no"?>

XML-Conformant HTML

A document which uses an HTML DTD which meets the XML standard may use both a robots PI and a Robots META tag. If both are present, they must indicate the same combination of index and follow actions. A robot should obey the directive that matches the parser in use (robots PI for XML, Robots META tag for HTML).

Sample Implementation

To encourage use of the robots PI, sample Java code is available (RobotsHandlerBase.java). This class is written to be used with SAX parsers. It implements a processingInstruction() handler which parses a robots PI and sets flags for follow and index. shouldIndex() and shouldFollow() accessors are provided. SAX handler classes may extend RobotsHandlerBase to take advantage of this functionality. See the RobotsHandlerBase API documentation for details.

Security considerations

The robots PI is not an access control mechanism, and does not make documents inaccessible to spiders.

Rationale

The robots Meta Tag has been successfully used since 1996. The robots PI is designed to provide exactly the same functionality to XML document authors. The robots PI does not go beyond the functionality of the robots Meta Tag for two reasons: first, the robots Meta Tag has proven adequate, and second, expanding the semantics would probably require design changes to robots (for example, a robot's database might include an index bit and a follow bit for each document).

References

Bray et al, 1998
Extensible Markup Language (XML) 1.0, eds. Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen. 10 February 1998. Available on-line at http://www.w3.org/TR/REC-xml
Raggett et al, 1998
HTML 4.0 Specification, eds. Dave Raggett, Arnaud Le Hors, and Ian Jacobs, 24 April 1998. Available on-line at http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.4.1.2
Kollar et al, 1996
Robot Exclusion Standard Revisited, Charles P. Kollar, John R. R. Leavitt, and Michael Mauldin, June 1996. Available on-line at http://www.kollar.com/robots.html
Koster, 1994
A Standard for Robot Exclusion , Martijn Koster, circa June 1994. Available on-line at http://info.webcrawler.com/mak/projects/robots/norobots.html
Koster, 1996
HTML Author's Guide to the Robots META tag Martijn Koster, site maintainer. Available on-line at http://info.webcrawler.com/mak/projects/robots/meta-user.html
Megginson et al, 1998
SAX 1.0: The Simple API for XML, David Megginson and members of the xml-dev mailing list. See web site at http://www.megginson.com/SAX/

Walter Underwood/wunder@infoseek.com