RDF Sitemaps

RDF sitemaps using HTML link types and Dublin Core

Author: Dan Brickley
Date: 26th January 1999
Version: first draft
Latest Version: http://purl.org/net/rdf/papers/sitemap/
Status: This is very much thinking-out-loud work in progress. I wanted to see whether sitemaps using DC + HTML links looked plausible, so I tried to write it up plausibly... comments etc to daniel.brickley@bristol.ac.uk
--Dan


Summary

This note proposes a minimalistic approach to the representation of Web sitemaps using W3C Resource Description Framework(RDF). By taking advantage of RDF's ability to create data structures that draw upon multiple vocabularies, we can build upon a minimalistic foundation for Web sitemap representation simply by drawing upon further RDF vocabularies. This document presents a general approach as well as a proposing a common sitemaps vocabulary consisting of the typed links enumerated in the HTML-4.0 specification combined with the Dublin Core properties for discovery-oriented resource description.

This note does not propose any new vocabulary, architectural or protocol components to support Web sitemaps. The term 'sitemap' is used here to refer to hierchically-oriented representations of Web content such as might appear in a 'table of contents' for a Web site, a bookmarking system in a browser or in a 'tree control'.


Contents

  1. Introduction
  2. RDF Background
  3. HTML Linking
  4. RDF Sitemaps
  5. Appendix A: An RDF Schema for the HTML link types
  6. References

Introduction

RDF Background

The Resource Description Framework provides, as might be expected, a framework for describing resources. The model is simple: RDF inhabits a world consisting of 'resources'. These are things which might be assigned one or more Uniform Resource Identifiers (URIs). A subset of resources are known as RDF 'properties', and can be used to describe the characteristics or attributes of other Web resources. RDF statements have three components: they describe a (URI named) property of some (URI named) web resource, and give the value of that property either as another (URI named) resource or as literal string such as "72" or "Web Consortium". Some properties can be used in many contexts, others make sense only when applied to some sub-category of resources or when the value of the resource takes some constrained form. The RDF Model and Syntax specification describes the abstract RDF data model in greater detail than can be given here, and provides an XML-based syntax for representing RDF statements in textual form. RDF vocabularies are defined according to the RDF Schema specification, which provides basic machinery for defining new properties, categories of resources ('classes') and constraints on the sensible or consistent use to which these properties and classes can be put. RDF defines only a very minimalist set of properties and classes sufficient to describe the RDF model and schema concepts themselves.

HTML Linking

The HTML 4 specification [http://www.w3.org/TR/REC-html40/] defines both a core set of "link types" for describing the relationship between web documents [http://www.w3.org/TR/REC-html40/types.html#h-6.12]. These link types are reproduced below:

[*** to do: fix dud hyperlinks ]

Alternate
Designates substitute versions for the document in which the link occurs. When used together with the lang attribute, it implies a translated version of the document. When used together with the media attribute, it implies a version designed for a different medium (or media).
Stylesheet
Refers to an external style sheet. See the section on external style sheets for details. This is used together with the link type "Alternate" for user-selectable alternate style sheets.
Start
Refers to the first document in a collection of documents. This link type tells search engines which document is considered by the author to be the starting point of the collection.
Next
Refers to the next document in an linear sequence of documents. User agents may choose to preload the "next" document, to reduce the perceived load time.
Prev
Refers to the previous document in an ordered series of documents. Some user agents also support the synonym "Previous".
Contents
Refers to a document serving as a table of contents. Some user agents also support the synonym ToC (from "Table of Contents").
Index
Refers to a document providing an index for the current document.
Glossary
Refers to a document providing a glossary of terms that pertain to the current document.
Copyright
Refers to a copyright statement for the current document.
Chapter
Refers to a document serving as a chapter in a collection of documents.
Section
Refers to a document serving as a section in a collection of documents.
Subsection
Refers to a document serving as a subsection in a collection of documents.
Appendix
Refers to a document serving as an appendix in a collection of documents.
Help
Refers to a document offering help (more information, links to other sources information, etc.)
Bookmark
Refers to a bookmark. A bookmark is a link to a key entry point within an extended document. The title attribute may be used, for example, to label the bookmark. Note that several bookmarks may be defined in each document.

These are intended for use with the LINK element in HTML 4.0 documents. The HTML specification notes that "Authors may wish to define additional link types not described in this specification. If they do so, they should use a profile to cite the conventions used to define the link types". The HTML specification does not define a syntax for profiles defining new types of link, but does ensure that profiles are identified using URIs.

This proposal does not require that HTML profiles are expressed using RDF. It does however require that the core HTML link types introduced at [http://www.w3.org/TR/REC-html40/types.html#h-6.12] have a representation within the RDF model, and for this the need to be addressable using URI references. One approach to assigning URIs to the HTML link types would be to produce a URI for each link type by combining the URI for the HTML 4.0 specification with the name of each link type. An alternative approach, adopted here, is to use an entirely separate set of URIs.

An additional complication is that the HTML link type names are case insensitive, whereas RDF identifiers are case sensitive. For this reason, any RDF representation of these types of link must choose one particular form (ie. 'Next','NEXT','next' in HTML 4.0 are all equivalent link types, but in RDF would be modelled as separate properties). The convention in RDF is to use initial lower case letters to name properties, where properties are the 'arc labels' or 'link types' which connect a graph of web resources within an RDF graph.

The most natural representation of HTML link types in the RDF model is as properties, ie. as typed links between web resources. Consequently this note adopts the convention of 'initial lower case', and represents the HTML 4.0 link type 'Prev' as an RDF property whose URI is 'http://purl.org/net/rdf/papers/sitemap#prev'. We use the URI of this document as the basis for the RDF URIs.

Extensibility: Relationship to other vocabularies

The XML Link working group are investigating richer typed linking in an XML context. A working group within the Dublin Core initiative are creating a specialised set of relationship types for use with the DC:Relation property. There are many other possible sources of relationship types that could be used in creating Web sitemaps. This document uses a combination of the HTML 4.0 linking vocabulary and the unqualified Dublin Core properties since they represent stable and well known 'best practice' specifications for Web content. Other types of relationship can easily be included within this approach to sitemaps, so long as an RDF representation of those relationships is available. RDF Schema provides a minimalist mechanism for expressing relationships between properties: rdfs:subPropertyOf. This allows us to say, for example, that rs:Copyright is an rdfs:subpropertyOf dc:Rights, ie. the former is a more specialised property whose presence can be taken to imply the presence of the latter. This might be used to indicate 'specialisation' relationships between the simple HTML-derrived vocabulary items used here and other useful sets of relationship types.

[todo: ... also WebDAV? classic hypertext refs? ....]

RDF Sitemaps

Appendix A (below) provides an RDF Schema for the link types (properties) introduced in HTML 4.0. An RDF representation of the Dublin Core is defined elsewhere [http://purl.org/dc/elements/1.0/] [http://www.w3.org/TR/WD-rdf-schema#dublincore].

The following types of RDF relationship (properties) are defined, based on the HTML link types:

Note that there is some overlap between the coverage of the Dublin Core and the HTML link types (eg. certain uses of DC:Relation and HTML's 'Alternative' overlap). Although the complete set of HTML link types are represented here for completeness, it is likely that only a subset of these will be used in an RDF sitemaps context.

Usage

By defining an RDF representation of the HTML link types, it is possible to create an RDF graph representing the structure of a site either through parsing HTML LINK elements or by using RDF's XML-based serialisation syntax, or some combination of both. For example, the following use of LINK in the HTML document http://www.w3.org/TR/REC-html40/types.html can be mapped into a set of RDF statements. In the following examples we use the abbreviation 'DC' to refer to the Dublin Core RDF vocabulary and 'RS' to refer to the RDF Sitemap vocabulary.

(HTML from the http://www.w3.org/TR/REC-html40/types.html page)

	<LINK rel="previous" href="charset.html">
	<LINK rel="next" href="struct/global.html">
	<LINK rel="contents" href="cover.html#toc">

...is equivalent to the following RDF statements:
<RDF    xmlns="http://w3.org/TR/1999/PR-rdf-syntax-19990105#"
        xmlns:rs="http://purl.org/net/rdf/papers/sitemap#"
        xmlns:dc="http://purl.org/dc/elements/1.0#">

 <Description about="http://www.w3.org/TR/REC-html40/types.html"
        <rs:previous resource="charset.html" />
        <rs:next resource="struct/global.html" />
        <rs:contents resource="cover.html#toc" />
 </Description>
</RDF>


The relationship information based on the HTML vocabulary might be augmented by additional metadata drawing on Dublin Core properties such as 'Creator' and 'Description'. We can also take advantage of the more sophisticated (and verbose) RDF syntax to express a number of statements about the structure of a website using a single block of RDF.

The following RDF shows how a sitemap might document the relationships that exist between documents alongside properties such as 'Title' for presentation in some form of user interface (eg. graphical map). Here we describe the fact that 'types.html' has a 'previous' document 'charset.html', whose Title and preceding document are also described.

<rdf:RDF    xmlns:rdf="http://w3.org/TR/1999/PR-rdf-syntax-19990105#"
        xmlns:rs="http://purl.org/net/rdf/papers/sitemap#"
        xmlns:dc="http://purl.org/dc/elements/1.0#">

 <rdf:Description rdf:about="http://www.w3.org/TR/REC-html40/types.html">

	<dc:Title>Basic HTML Data Types</dc:Title>

        <rs:previous>
		<rdf:Description about="charset.html">
			<dc:Title>HTML Document Representation</dc:Title>
		        <rs:previous rdf:resource="struct/global.html" />
		</rdf:Description>
	</rs:previous>

        <rs:next rdf:resource="struct/global.html" />

        <rs:contents rdf:resource="cover.html#toc" />

 </rdf:Description>
</rdf:RDF>



These statements can be represented in RDF triple form as follows:

triple("http://purl.org/dc/elements/1.0#Title","http://www.w3.org/TR/REC-html40/types.html","Basic HTML Data Types").
triple("http://purl.org/dc/elements/1.0#Title","charset.html","HTML Document Representation").
triple("http://purl.org/net/rdf/papers/sitemap#previous","charset.html","struct/global.html").
triple("http://purl.org/net/rdf/papers/sitemap#previous","http://www.w3.org/TR/REC-html40/types.html","charset.html").
triple("http://purl.org/net/rdf/papers/sitemap#next","http://www.w3.org/TR/REC-html40/types.html","struct/global.html").
triple("http://purl.org/net/rdf/papers/sitemap#contents","http://www.w3.org/TR/REC-html40/types.html","cover.html#toc").

(note that relative URIs should be resolved before storing in an RDF
database)

This document does not make any recommendations about the way in which sitemaps should be represented in concrete form on a Web site. This will vary depending on context, document format and workflow issues. In many cases, HTML 4.0 alone could be adequate, while others might prefer to manage all information about document relationships in a more centralised manner. The vocabulary proposals made here are equally appropriate in both contexts (see [DublinCore] for guidelines on expressing the Dublin Core within the constraints of the HTML metadata 'META' tags).



[more examples of hierarchy here]

Appendix A: An RDF Schema for the HTML link types

The following RDF describes the HTML link types as a set of RDF properties. The formal identifiers for each property are lower case (eg. 'next'), and are accompanied by language-tagged human readable labels and textual comments abstracted from the HTML 4.0 definitions. A machine-processable version of this RDF is also embedded in the source of this document (which will consequently fail to validate against the HTML DTD).


<RDF 	xml:lang="en"
	xmlns="http://w3.org/TR/1999/PR-rdf-syntax-19990105#"
	xmlns:rdfs="http://www.w3.org/TR/WD-rdf-schema#"	>


<Property 
ID="alternate" 
rdfs:label="Alternate"
rdfs:comment=
"Designates substitute versions for the document in which the link occurs."/>

<Property ID="stylesheet"
rdfs:label="Stylesheet" 
rdfs:comment="Refers to an external style sheet."/>

<Property ID="start"
rdfs:label="Start" 
rdfs:comment=
"Refers to the first document in a collection of documents.
This property tells search engines which document is considered by the
author to be the starting point of the collection."/>

<Property ID="next"
rdfs:label="Next"
rdfs:comment=
"Refers to the next document in an linear sequence of documents."/>

<Property ID="prev"
rdfs:label="Prev" 
rdfs:comment= 
"Refers to the previous document in an ordered series of documents."/>

<Property ID="contents"
rdfs:label="Contents" 
rdfs:comment="Refers to a document serving as a table of contents."/>


<Property ID="index"
rdfs:label="Index" 
rdfs:comment=
"Refers to a document providing an index for the current document."/>

<Property ID="glossary"
rdfs:label="Glossary"
rdfs:comment=
"Refers to a document providing a glossary of terms that pertain to
the current document."/>

<Property ID="copyright"
rdfs:label="Copyright"
rdfs:comment="Refers to a copyright statement for the current document."/>

<Property ID="chapter"
rdfs:label="Chapter"
rdfs:comment="Refers to a document serving as a chapter in a collection of
documents."/>

<Property ID="section"
rdfs:label="Section"
rdfs:comment=" Refers to a document serving as a section in a collection of
documents."/>

<Property ID="subsection"
rdfs:label="Subsection"
rdfs:comment="Refers to a document serving as a subsection in a collection of documents."/>

<Property ID="appendix"
rdfs:label="Appendix"
rdfs:comment=
"Refers to a document serving as an appendix in a collection of documents."/>  


<Property ID="help" 
rdfs:label="Help"
rdfs:comment=
"Refers to a document offering help (more information, links to other
sources information, etc.)"/>

<Property ID="bookmark"
rdfs:label="Bookmark"
rdfs:comment="Refers to a bookmark. A bookmark is a link to a key entry point
within an extended document. Note that several bookmarks may be defined in
each document."/>

</RDF>

References

[...later...]
this page maintained by: Dan Brickley