ILRT Research Report Number: 1017
Publication Date: 2003/06/04
Last Modified : 2003/06/11 15:10
Author: Dave Beckett
This report reviews the process that was undertaken in revising the transfer syntax for RDF as defined in the RDF Model and Syntax W3C Recommendation by the RDF Core Working Group and the problems that are now clear especially comparing the revised RDF model and new abstract syntax. The syntax looks out of date in particular with the use of XML QNames giving unconstrained syntax terms in the XML, causing problems with newer XML technology such as XSLT, DTDS and W3C XML Schema and other XML-constraining languages. In order to deliver a modern RDF syntax, this report reviews the requirements for RDF in two aspects - as a canonical transfer syntax and one for end-users, targeted at HTML. It evaluates previous RDF syntax proposals against these requirements and analyses the pros and cons of XML and non-XML syntaxes. The conclusion is a summary of syntax approaches for future standardisation activity.
This report reviews the process revising the transfer syntax for RDF as defined in the W3C RDF Model and Syntax W3C Recommendation (M&S) in February 1999. This syntax was designed for a variety of goals by the RDF working group including enabling it to be embedded in HTML (not XHTML) in order to describe web pages, with a frame-style syntax and using XML QNames in order to shorten the long URIs that RDF uses for its terms. The XML Namespace specification was developed in parallel with RDF, and RDF was one of the first W3C specifications to use it. Figure 1 shows some RDF/XML that captures the sentence Ora Lassila is the creator of the resource http://www.w3.org/Home/Lassila.
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:s="http://description.org/schema/"> <rdf:Description about="http://www.w3.org/Home/Lassila"> <s:Creator>Ora Lassila</s:Creator> </rdf:Description> </rdf:RDF>
rdf:RDF XML element encloses the scope of
the RDF/XML. The inner
rdf:Description element is the
``frame-style'' block of properties, all about the resource with
URI http://www.w3.org/Home/Lassila. Here the element
s:Creator represents the property with the value ``Ora
Lassila''. This element encodes for the URI reference that is
defined by the namespace name (URI) for ``s'' which in this case is
http://description.org/schema/ concatenated with the local name of
the element (Creator) giving the URI
When a property has a URI value, an
attribute is used on the empty property element with the URI as the
attribute value. A property value can also have an XML language,
given with an
xml:lang attribute and can have an XML
content when the
parseType="Literal" attribute is used
on the property element.
In order to allow embedding in HTML such that simple web clients could ignore them, this required no visible XML element content. The syntax allowed this by several abbreviations including writing properties with literal content as XML attributes in what was called the Basic Abbreviated Syntax form.
There were several other abbreviations both to make the
resulting RDF/XML more compact and to allow the omission of
description blocks. Several common RDF vocabulary terms had special
support such as the
rdf:type property and the
reification vocabulary. RDF containers (ordered / unordered /
alternative of list of resources) had an abbreviation that provided
easy generation of the container membership properties.
Above the statement level there were generators for distributed
description in three ways. The
aboutEachPrefix attributes allowed statements to be made
about multiple resources in a container (the former) or about all
resources with a URI of a certain prefix (the latter). The
bagID attribute allowed descriptions of the collection of
statements given in one of the frame-style descriptions using RDF
The RDF/XML syntax was defined by an extended BNF in a formal grammar along with descriptive text in several sections of the document. The use of namespaced elements and attributes meant that using a DTD to define it was not possible and this was before modern XML schema language standardisation work was started so there was no W3C XML Schema, Relax or Relax NG etc. available.
The RDF specification and syntax was picked up by several groups working with metadata on the web and the RDF/XML syntax was used to transfer their RDF content. These users included the Dublin Core community, DAML+OIL, CC/PP, PRISM, RSS 1.0 and Adobe's XMP. These groups along with other implementors gave feedback to the W3C via the RDF comments and interest group lists on issues for RDF and RDF Schema that needed answering. The W3C Semantic Web Activity begin in February 2001 and started a new working group for RDF, the RDF Core working group to deal with these.
RDF Core's charter was to:
The single specification (of RDF, excluding RDF schema) was found to have problems with mixing the model, syntax and semantics. A clear split was required and RDF Core dealt with this by creating a RDF Concepts and Abstract Syntax document where the RDF graph was defined with no concrete syntax and an RDF Semantics document that used it to define the model theoretic semantics of RDF. The remaining parts of introduction and explanatory material was covered by the RDF Primer and the RDF/XML syntax material in a separate RDF/XML Syntax Specification (Revised) document. The development of the latter document is the primary subject of this report.
The original RDF formal model in  gave three different descriptions of RDF - as 3-tuples (triples), as a graph of directed labelled arcs and as 2-ary (binary) predicates. The XML syntax hinted at a fourth frame/slots-style formulation. RDF Core adopted the expression of the RDF abstract syntax in terms of sets of triples and used that as the single method to connect the separate documents in revising and clarifying RDF. The model also was clarified in certain key places; M&S allowed nodes to appear in the graph without URIs, but this was never discussed in detail. The updated specifications named them ``blank nodes'' and allowed syntax-specific identifiers to be used in order to preserve their identity in serialisations.
The specification of RDF/XML in Section 6 of  via an extended BNF grammar had several problems:
Comments on the RDF documents were recorded on the RDF Issue List as well as taken from discussion on the RDF interest group and other lists. The main issues from these fora are as follows:
The XML specification has changed since 1.0 in 1998 with XML (Second Edition) (2000) and there has been further work in building upon and formalising the XML core document. There are many XML technologies that have been developed; the main ones related to core XML that have widespread development are XML Base and XML Infoset. XML Base allows a document to specify a URI for resolving relative URIs against. XML Infoset, which itself uses XML Base, defines one way of expressing XML without depending on certain XML details. Both of these have been used to define other important specifications such as W3C XML Schema, XPath and XQuery. The XML and XML Namespaces specifications are also being modified to version 1.1 for enhanced internationalisation (I18N) support by using newer Unicode versions and updated whitespace rules. The Character Model for the World Wide Web explains current best practice in how web specifications should provide interoperable text manipulation. The W3C Technical Architecture Group (TAG) work on key web technology such as URIs, resources and namespaces also touched on many issues that RDF and RDF/XML depended on. The existing RDF/XML definition needed to be updated in light of these considerations.
The new RDF/XML definition was required to be precise as possible and provide a complete and deterministic mapping from XML to the RDF graph such that it was clear that an isomorphic graph was always generated from the given XML. As the group decided issues on the mapping, these were codified as test cases written in a form suitable for checking by software (disccussed below).
In order to abstract away from the XML detail, the revised RDF/XML document was specified from XML Infoset Information Items (Infoitems) which provides the required formality and supports XML Base and Namespaces above the core XML. Infoitems provide an model of the XML still in terms of elements, attributes and character data. This is not totally appropriate for defining the mapping from RDF/XML to the RDF graph which deals with URI-references rather than XML QNames. An intermediate syntax data model form was designed that mapped one-to-one from the Infoitems to syntax data model Events (the name implies no time-based processing, it was used in order to distinguish it from Infoitems, RDF triple objects and RDF graph nodes).
The syntax data model also defined new Events for terms that would be used in the mapping to the RDF Concepts such as for URI references, blank nodes, RDF literals (with language) and Typed Literals (with datatype).
The transformation to syntax data model events turned the QNames
in the Infoitems into RDF URI References in the Events and
attribute values that were used as URI strings into RDF URI
References relative to the current base URI. It also dealt with
some corner cases such as handling allowing both non-namespace
rdf:about (transforming to the
latter) as well as turning
xml:lang on XML literals
into language-coded RDF Literals.
In order to write down the mapping in test cases, the input form was RDF/XML however the output form had to match the RDF graph very precisely - in triples - and it was clear that RDF/XML did not clearly match the triples abstract syntax.
Therefore RDF Core defined a text-based, simple and canonical RDF test case language N-Triples in the RDF Test Cases working draft. This format allowed RDF graphs to be written down clearly and used not just in RDF/XML/RDF graph mappings but in other graph-to-graph entailment tests in the RDF Semantics specification. N-Triples also enabled discussions of the abstract syntax separate from XML issues. This format was not intended as a new user syntax (discussed further in Section 8).
The new working draft defined the mapping from RDF/XML to the abstract RDF Graph (using N-Triples to write down that abstraction) along with test cases in  to pin down the issue resolution precisely in testable form. A user-friendly explanation of the syntax was added to the working draft to explain all the abbreviated forms begining with an introduction of the property / node element ``striping'' so that this alternation was more explicitly pointed out to readers.
RDF Core considered some particularly problematic parts of the
syntax and after due consideration and consultation with the
community, removed some from the language. The items that were
deleted were the distributed referents
aboutEachPrefix. These were felt
to operate at the wrong level as well as having complex
interactions, were sporadically implemented and little used.
Removing them made the syntax data model more clearly based about
mapping to triples rather than deal with these somewhat higher
Updating and formalising the RDF model as well as resolving issues caused additional syntax to be added. The additions were:
rdf:nodeIDattribute to specify a blank node identifier as the subject or object of a triple instead of a URI (fixing problem 11 in Section 3.2).
rdf:datatypeattribute to give the URI of the datatype of a typed literal (which could be a W3C XML Schema datatype).
rdf:parseType="Collection"attribute for specifying a closed collection of nodes as the object of a property This was required and used by the Web Ontology Language (OWL) which started to be developed by the Web Ontology Working Group (WebOnt) in August 2001.
The design of the last in terms of triples was copied from the DAML+OIL syntax extension over the original RDF/XML and provides a lisp-style list of resources that could be more easily reasoned over than the RDF container style approach. This design from DAML+OIL was not able to describe a collection of datatyped literals which was another problem raised during the revising.
The charter restriction about not making a new syntax limited what the group could do in terms of changing or paring down the syntax beyond what was strictly required to fix problems or reply to issues. This included creating an XML syntax that closely mapped the abstract syntax and could represent all legal RDF graphs, deleting some of the many abbreviation forms, and updating to use newly available XML technologies such as defining the syntax in terms of an XML schema (due to the unconstrainedness). This suggested requirements on a future syntax to provide one that can represent all RDF graphs, and also a canonical RDF graph serialisation. Here canonical means no abbreviated or alternative forms for individual triples, not whole-graph canonicalisation.
XML best practice and style has also evolved from when RDF/XML was created, XML was one year old and XML Namespaces was one month old, so that the design may not seem so modern.
Subsequent to the RDF recommendation in 1999, there has been
some consideration of embedding RDF in other XML formats. The
Scalable Vector Graphics (SVG) recommendation contains a
metadata tag that can contain any metadata, with RDF given
as the example.
The main RDF embedding issue that arise is for HTML, and in
particular with XHTML. Embedding in either format without
validation is straightforward but validation of HTML is generally
seen as the current best practice. Adding new elements to XHTML is
expected to be done by using XHTML Modularization and validating
via DTDs or W3C XML Schema. Adding an unconstrained RDF/XML syntax
this way is rather tricky for a particular subset of RDF/XML or
impossible for arbitrary RDF/XML. RDF Core recommended not
embedding but using a
<link> tag in the
<head> of the document to give the URI of what could
potentially be another RDF/XML document.
Another approach that has been tried is embedding RDF/XML in HTML comments as used by the Creative Commons (CC) metadata and Movable Type's TrackBack RDF approaches. It is not clear if this is portable since the content of HTML or XML comments are not generally seen as being in the document content and it cannot be distinguished from other commented materials. However, the existing applications of these do not use HTML or even XML-level techniques to find the embedded RDF but rely on regular expression matches on the surrounding HTML.
There are two general classes of syntax that have been found needed from the existing development of RDF/XML and discussion with other communities:
These have such different targets that they may not be met by a single syntax since the former tends to suggest minimal use of user-friendly forms and the latter would have ``syntactic sugar'' to enable both common and complex RDF triple structures to be written concisely. A single syntax may work poorly at both jobs and remain inappropriate for both - not much of an improvement over the current state. It is not clear that there is even one good end user syntax, rather than many for different communities.
The requirements for a future syntax come from the problem reports on the existing syntax, experience from issues that emerged during the revision of RDF/XML, comments on the new syntax working drafts and also recorded issues on RDF Core's postponed issue list. These were mostly postponed due to it being out of scope of the group's charter. The following sections contain the requirements grouped into approximate categories.
These requirements come from the lessons learnt from the current syntax and feedback and must be satisfied. The reported problems enumerated in Section 3.2 are given where associated with a requirement. Any new RDF syntax:
These came as advice from the XML community and W3C XML working groups on how to modernise the XML to current best practice and make it easier to work with using other XML technologies and tools. An RDF syntax expressed in XML should:
xsi:type. (Problem 7)
Any RDF syntax intended for hand-production by end users should provide:
These are not immediate requirements but the lack of an easy way to do these as modifications to RDF/XML influenced RDF Core from making changes such as these to the RDF model. Syntax for an extended RDF model would provide support for:
The parts of RDF/XML that made embedding in non-validated HTML possible are also those that make up the excessive number of alternate forms (for example, all property attributes of RDF/XML could be removed and the syntax would be able to represent all the same graphs as at present). This means that a design for embedding in this way would clash with a minimal design. However, in this case, a design for embedding in XHTML would require DTD or W3C XML Schema validation via using XHTML Modularization so the same approach as RDF/XML would not be possible. More detailed discussion of these problems will be discussed in Section 8.
There have been several proposals for new syntaxes for RDF, both aimed as canonical transfer syntaxes, end-user syntaxes and a combination thereof. These have included proposals to add or remove functionality to RDF/XML or HTML to make embedding RDF more convenient, entirely new XML syntaxes, using existing XML technologies to define a transfer encoding and also non-XML proposals aimed at making things easier to write. DTD based approaches using the existing RDF/XML have been possible but only when the terms in the application are limited in scope.
It is clear that RDF/XML has already too many options in the
ways to encode RDF graphs (although some people have proposed
more). So a true subset of RDF/XML could be used as a recommended
form. This is the approach used by Adobe's XMP which encodes a profile of
RDF/XML inside several formats (TIFF, JPG, PNG, HTML, PDF and
others) to describe the content. Seven items were removed or
changed from RDF/XML -
rdf:RDF was made required and
rdf:parseType="Literal" were forbidden. This smaller
profile has been called ``RDF/XML-7'' and has been successfully
deployed inside several Adobe file formats. Berners-Lee considered
another subset of RDF/XML but
without the node/property element striping. This led to a rather
complex set of additions in order to declare the current subject of
XML has a linking technology XLink and a way to point to parts of XML documents (XPointer) that could be used to encode a graph similar to RDF. This was recognised early on in the design of these technologies - whereas RDF has links built in, XML has linking added outside the core. Daniel described a mapping from XLinked-XML to RDF and more recently as ongoing work of the W3C TAG, they have been considering the kind of document that might live behind an XML namespace URI. This document potentially could link to several other resources such as style sheets, schemas and RDF descriptions. The current best proposal RDDL by Borden and Bray is based on XLinks inside HTML. This provides a mapping to RDF/XML using XSLT however is not a general approach for RDF, targeted at a simpler problem of a catalogue of resources.
Berners-Lee's Notation 3 (N3) (2000-) is a ``an academic exercise in
language designed for a human-readable and scribblable[sic]
language''. The N3 language and its primary implementation CWM
describe a research language that includes functionality outside
the RDF model. The syntax defines a text format using a BNF-like
grammar that uses a lot of punctuation to abbreviate the RDF. Each
RDF triple can be given as a set of three terms explicitly or
abbreviated in a variety of forms using a form that operates like
XML QNames in RDF/XML. Declarations are allowed starting with
@ such as
@prefix to attach a namespace
URIs to a short prefix. This is similar to how Cascading Style
Sheets (CSS) escapes from its text-based grammar to add higher
level directives such as
RDF Core designed N-Triples (described in RDF Test Cases) as a true subset of N3, with no abbreviated forms allowed. This restriction and simplicity meant that existing N3 tools could read N-Triples documents and being a simple format to understand, it was quick to read, write and implement in dealing with test case descriptions. The advantages and disadvantages of non-XML formats are discussed below in Section 8.
A more recent proposal for a new XML format was Bray's RPV ``designed to be entirely unambiguous and highly human-readable.''. It took a strong resource-centred approach describing a particular resource with the properties and values parts of the RDF triple very clearly written, using a small number of elements and attributes. It was restricted in the triples that could be written in the graph, for example providing no blank node or datatyped literals support and inventing a new base URI mechanism, parallel to XML Base but applying to individual triple parts.
As already introduced, N-Triples and N3 are existing RDF syntaxes that have been deployed successfully as a test case language and a format that is very compact and powerful. Designing a new syntax and not using XML has costs as well as benefits in terms of perceived simplicity that need to be drawn out. XML is generally required by W3C policy for serialisations except where it is excruciatingly painful.
A text format will typically be MIME type text/something such as text/plain. If it is sent without an encoding, the receiving software is required to treat it as US-ASCII. This means text formats lose one of XML's big wins - built-in Unicode. The CSS language is one widely used text web format which had to solve this flaw, and in CSS2 it gained an @charset directive to allow specifying of the charset. A similar directive could be added to N-Triples or N3. However, N3 was changed recently from being an US-ASCII format to UTF-8 encoded so that some native encoding of characters are possible, albeit with a restriction to what might be a non-preferred encoding.
Although a text based format might be superficially seen as easy to read and write, it does mean writing new tools that deal with the lexical analysis, grammar (and if used, Unicode decoding and encoding). These are the aspects that are implemented and made available by many well-tested XML tools and APIs that abstract from the detail and can be assumed to be available.
However, these formats do give (in the least abbreviated form, N-Triples) a very clear description of the RDF triples and can make the long URIs disappear from user view, when the XML QName-style abbreviations are used. (Both RDF Core and WebOnt use N-Triples with QName-style abbreviations in their documents). This gives the advantages of improved clarity and reduced verbosity that aid comprehension.
New syntaxes written in XML also have a cost, in terms of choosing which XML abstraction to base upon. The revised RDF/XML syntax used the XML Infoset which is the (direct) basis of W3C XML Schema's PSVI and others. Earlier XML technology was designed on SGML, DTDs, DOM and more recently are new data models such as the ongoing design of the XQuery 1.0 and/ XPath 2.0 Data Model.
The SOAP Encoding (Section 3, ) allows the encoding of directed labelled graphs, although it is not yet clear if all RDF graphs could be transfered via this method (apart from embedding RDF/XML in a naive form). In particular it may be that there is no way to encode blank nodes or RDF datatypes - however whether this is possible is still an ongoing research issue.
A new syntax should be closely based on the RDF graph via the terminology in RDF Concepts and Abstract Syntax so that it is complete, and also take into account the requirements given earlier (Section 6). In particular the critical requirements (Section 6.1) will be met if it closely aligns with the abstract syntax.
A new XML syntax that looks like the abstract syntax will tend
to seem like an XML-ized version of N-Triples, if it is minimal.
This is sufficient but does not meet the additional XML
requirements (Section 6.2)
that suggest using some more modern XML design ideas e.g. QNames.
At present RDF/XML uses QNames only as the element and attribute
names however newer XML work such as W3C XML Schema use and allow
them as attribute values to identify concepts that are identified
by a (namespace name, local name) pair. RDF does not use such
identifiers, so QNames could only be make to define or refer to
URI-references, blank node identifiers or literals. This suggests
continuing the RDF/XML approach of concatenating the (namespace
name, local name) to give a URI. However, QNames used in this
fashion cannot encode all URI-references so cannot be used as the
sole way to encode identifiers for RDF graphs, and thus there must
be a way to give any URI. This tends to suggest having either both
QName-style and longer URI-style approaches. However, allowing
QNames in element content (or attribute values) causes problems
such as invisibility from XML processors, XML Namespace scoping and
with XML Canonicalization. Mixing QNames with URIs in similar
fields can cause interoperability problems since the syntax of both
are very similar -
ex:prop is a syntactically legal
QName and URI with URI scheme
XML entities are another alternative for abbreviating URIs into shorter forms but they are tied very closely to DTDs and are also are not possible to validate with the current W3C XML Schema.
In terms of minimising the vocabulary used for an XML syntax,
this means that the elements and attributes must be fixed, with the
varying parts of the triples either in element or attribute content
(CDATA, or defined by other W3C XML Schema datatype). Given the
requirement to encode all RDF, this means that the distinction
between URIs, blank nodes and literals needs to be made either by
additional elements or attributes. The additional element for each
part of the triple will tend to give a very verbose appearance as
shown in Figure 9 although
<literal> element could be omitted here with
the loss of regularity.
<triple> <node><uri>http://www.w3.org/Home/Lassila</uri></node> <node><qname>s:Creator</qname></node> <node><literal>Ora Lassila</literal></node> </triple>
This syntax does not seem very ``modern'' although it is minimal
in use of elements. It addition, it might be better to replace the
node element with subject, predicate and object in
particular to enforce current RDF model requirements on where URIs,
blank nodes and literals can be used.
The main alternative to an all-element approach is to use XML attributes to indicate the triple part such as that shown in Figure 9.
<triple> <subject uri="http://www.w3.org/Home/Lassila" /> <predicate ref="s:Creator" /> <object>Ora Lassila</object> </triple>
This looks more similar to the kind of syntax seen in W3C XML
Schema although the attribute names might be slightly different. It
is now that introducing the
xsi:type commonly used for
indicating the content is W3C XML Schema datatypes would fit in
well. QNames, URIs and blank nodes would be all needed which
requires both defining and referring attributes for all of
The main syntax shortcuts that are very common and could be
added are the
rdf:type property and the collection and
container forms. An additional
type element could
replace the predicate and object elements with a single
<type ref="a:Class" /> although that would remove the
clear triple view. The container and collections are patterns that
respectively generate properties or more complex sets of nodes.
These might benefit from support, particularly the latter which is
very long to write out longhand, so an additional
collection element with contained nodes could be added in a
form something like that shown in Figure 9 (also using an
<collection> <node uri="http://example.org/resource" /> <node ref="ex:anotherResource" /> <node xsi:type="xsd:decimal">10</node> </collection>
A new text-based syntax should be probably be something very
similar to the above outline XML design, with influence from
N-Triples and N3 given that they have been found relatively easy to
explain (at least in the most regular triple form). The latter has
an excess of punctuation for either a minimal or user-friendly
language so would have to be cut down dramatically, but the most
commonly used ideas given above have analogues in N3 (QNames,
prefixes, datatypes, collections). As already discussed, careful
updates for internationalisation support such as declaring of
charset and enabling the use of local characters in URIs and
literals would have to be designed, enabling as much flexibility as
in XML such as an optional
If a such a text syntax and an XML one were being designed, it would be a great benefit if they were of a similar level of complexity and preferably, providing as far as possible equivalent mappings to the same model. This has been successfully achieved with the RELAX NG XML schema language and it's text equivalent RELAX NG Compact.
The other main requirement of embedding in XHTML would clearly
be most suitable for an XML syntax. In XHTML, given that the above
outline uses element content, it would probably not work in the
simplest of web clients that ignored the XML detail (however modern
XML browsers will ignore
<head> content). As
previously mentioned, the XHTML Modulaization would be required for
such embedded content, which also requires the application to
perform schema validation of either sort, unusual for a web client.
A recent strawman proposal from the XHTML working group, also being
tracked by the W3C TAG was to use the existing HTML
<meta> tag and give it an overloaded definition
suitable for directly transporting RDF content. In order to keep
validation, this would be in a future version of XHTML and not of
The revision of the RDF/XML syntax into a more precise specification, based strongly on XML technology has allowed existing and new implementors to better update and write their applications. In the new form they can be machine checked them for correctness directly based on test cases in the specification. This has allowed several new and verifyably complete and conformant applications of the RDF/XML to RDF graph mapping to be available before the revising work process was fully completed.
The RDF/XML syntax as revised is a more workable and durable format for transfering and writing RDF. It is now both more complete in handling more of the RDF model from some critical additions and cleaner after removing what turned out to be underused and most unclear parts.
There were several issues that existed before revising RDF/XML and there are some that still remain. This report has discussed future work on new XML and textual syntaxes, the approaches and compromises that addressing soem of them would require. It has shown that it is not trivial to make a clearly better syntax, that one syntax will not suit all purposes, and that there are both benefits and costs of persuing multiple ways to write the same thing, especially when they are written with different audiences in mind.
Thanks to Bijan Parsia and Jan Grant for their encouragement and support.
This report reports on work done under the Semantic Web Advanced Development Europe (SWAD-Europe) project< http://www.w3.org/2001/sw/Europe/> funded by the EU IST-7 programme IST-2001-34732.