Author: Libby Miller, ILRT
Last updated: 1999-10-01
Latest version: http://ilrt.org/discovery/2000/09/rudolf/recommender.html
This paper describes an early implementation of an XML/RDF-based metadata service environment, focussing on the convergence of cataloguing, bookmarking, recommendation and filtering applications facilitated by XML and RDF. The recommendation system uses RDF to store user recommendations and annotation data, using the RDF syntax and the Dublin Core vocabulary. It is entirely configurable also using RDF/XML. The display of the recommendations is left open, although they may be displayed using an RDF display tool which opens up the possibility of aggregating diverse recommendations, remote or local. The paper then discusses several different models of metadata aggregation with respect to the incentives of contributors to create high-quality metadata, and the evaluation strategies of the aggregation sites.
Keywords: RDF; annotations; recommender; aggregation; metadata
We all have metadata about the web stored in bookmarks in our browsers. This paper describes a simple way in which the bookmarking of webpages could be changed from being a private to a public event, allowing the construction of rich databases of metadata about the web in RDF. Section 2 of the paper serves as a brief introduction to aggregating RDF. Section 3 describes a very simple recommender system, which uses RDF to store recommendations data. Here we show that it is a simple matter to aggregate recommendations from different sources, including from RDF serialized in XML. Section 4 section outlines the advantages of combining metadata from small distributed groups, and compares this model of metadata aggregation with the editorial model used by the Open Directory and similar projects and with the evaluative model of the collaborative filtering community. Section 5 concludes.
The Resource Description Framework (RDF ) is a W3C Recommendation for data interchange and modeling amongst metadata applications on the Web. RDF provides a simple data model based around the notion of directed labeled graphs. Each 'graph' of RDF information consists of a set of nodes (representing Web 'resources') and arcs, representing properties (i.e. relationships and attributes). An RDF statement (or 'triple') consists of a resource, a property, and a value which may be either another property or a simple textual value. An example is :http://www.bized.ac.uk has title Biz/ed
In RDF each element of a triple has an absolute identifier. RDF therefore uses vocabularies to describe the triples, one of which is the Dublin Core vocabulary . Using this the above triple would become
RDF has powerful implications for the aggregation of data for two reasons. Firstly, because it uses unique identifiers for resources, RDF has the property that graphs can be overlaid on each other so that data in one RDF database can seamlessly augment the data in another . An example might be if one triple was this:
and another was this:
a query for all the Dublin Core descriptions for http://www.bized.ac.uk would produce both of the two literals:
"good for business resources" "good for economics resources"
This means that the contents of an RDF database can be augmented dynamically as more information arrives about a resource. This is only possible because of the unique identifier for the resource (http://www.bized.ac.uk) and the property (http://purl.org/dc/elements/1.0/Description). Only because [http://www.bized.ac.uk/] is the globally unique identifier of a resource can we be sure that both comments refer to the resource [http://www.bized.ac.uk]. Similarly, only because the arc property is also unique can we be certain (if we pay attention to the vocabulary) what aspect or property of the resource [http://www.bized.ac.uk] it refers to.
The second implication for the aggregation of data arises because RDF data is machine processable, can be outputted by machine, and is also serializable in XML. Because of this, RDF data from remote and local sources can be easily and seamlessly integrated and displayed.
RDF therefore provides a way of aggregating diverse data about uniquely identifiable things. One obvious application of this is for the aggregation of metadata about webpages, although it is by no means limited to this use. One major advantage of using this system is that metadata from diverse sources could easily be integrated. So for example data from different recommendation systems or from very different places could be joined together. The Open Directory, for example, can be accessed as XML/RDF . RSS channels can be integrated, and are easy to build mechanically . Search results can also be built into this mechanism. This aspect of integrating diverse sources of information using RDF is used by the Aurora project in Mozilla .
The next section describes how we have implemented a simple recommender system using RDF.
There are many annotations and shared bookmark sites available , but they are often based around creating bookmarks for oneself and making them public, or keeping them private but making them accessible to yourself from different locations instead of only from a single browser. In contrast, the system described below is designed to be used by a small group of individuals to share their recommendations about webpages, functioning a little like a mailing list. Because the system uses RDF, recommendations from different groups can easily be aggregated together.
The main issue with the input is the form of the RDF to be
stored. We have used the Dublin Core and RDF syntax to describe the
resources, so as to insure interoperability as far as possible. To
describe the persons who recommend webpages we have used an interim
vocabulary, but the hope is that some more widely acceptable
vocabulary will become available for describing individuals.
A given recommendation is described in RDF as:
or for clarity, in XML:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.0/" xmlns:r=" http://rdf.desire.org/vocab/recommend.rdf#"> <rdf:Title> <rdf:subject resource="http://www.bized.ac.uk" /> <rdf:predicate resource="http://purl.org/dc/elements/1.0/Subject" /> <rdf:object>Biz/ed</rdf:object> <rdf:type resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement"/> <r:attributedTo resource="http://www.ilrt.bris.ac.uk/12345" /> </rdf:Description> </rdf:RDF>
Additional nodes can be added for dc:Description and
This annotation of a webpage is not treated as an aspect of the web page itself, but as a separate and well-defined comment about the webpage . A new node is created for the comment with fields indicating that it is a reification, and an attributedTo field to specify who made the recommendation.
Similarly, we describe a person in RDF as:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.0/" xmlns:r="http://rdf.desire.org/vocab/recommend.rdf#"> <rdf:Description rdf:about="http://www.ilrt.bris.ac.uk/12345"> <r:name>Libby Miller</r:name> <r:homepage rdf:resource="http://recommendations/groups/economics" /> <r:affiliation rdf:resource="http://recommendations/groups/economics/12345.html" /> </rdf:Description> </rdf:RDF>
Note that the person making the recommender has a unique identifier in the model, so that who made the recommendation is identifiable even if recommendations are aggregated. r:homepage could provide a description of the person, their interests and their qualifications. r:affiliation could refer to the webpage of the group to which the person making the recommendation is affiliated, perhaps containing a group recommendation policy or a statement of interests.
The input script accesses a password-protected servlet that which the data as RDF triples into a SQL database. The data is stored in the database as triples, for example
subject predicate object http://www.bized.ac.uk http://purl.org/dc/elements/1.0/Title "Biz/ed"
This is equivalent to Eric Miller's "naive triple store" .
To access the data from the database we have had to develop an interim API for the query of RDF databases. The two principle methods are:
if we have
One class accesses the database, and all servlet queries are directed through this classes. Servlets are used to display the data as XML/RDF.
These are the core elements of the data output. Servlets can be constructed using individual queries to the database, for example for all the recommendations from a given person. Different views into the database can be produced by different queries, for example the URLs can be displayed by subject or by recommender, or according to the affiliation of the person making the recommendation.
The most interesting use of the model is aggregating the data from servlets such as this to produce combined data from different sources. To do this we used an RDF display tool which can be used to display data from diverse sources as folders which can be opened and closed.
The display servlet is configured using RDF/XML files which provide the details of the relational database to use and the URIs of the RDF data sources that each individual using the system is interested in. The RDF is gathered into an in-memory RDF database for display in two distinct ways. RDF can be asserted directly from the local SQL triple store into the in-memory graph by querying the SQL database. Alternatively the RDF can be parsed from its serialized XML form into triples which are then entered into the in-memory database. We use the SiRPAC parser  which in turn uses the AElfred XML parser  to achieve this. In this case the serialized RDF may be the output of a Perl script or a servlet or other machine output or hand-coded. This architecture is similar to that adopted by the Mozilla RDF back-end architecture .
There are decisions to be made about the display of this data. If the display of the RDF is fixed then it is possible to hard-code the display of the data into the database. So for example, if we know that we want to display the recommendations by person, we can add the following RDF into the database:
http://www.ilrt.bris.ac.uk/12345 r:child http://www.bized.ac.uk
so that for the purposes of display, the person ID [http://www.ilrt.bris.ac.uk/12345] has as its child the resource [http://www.bized.ac.uk]. In this case, we can traverse the graph for display purposes and output results according to these display tags.
For only one view we require one container property and one non-container resource attached to the resources we are interested in displaying. If we wish to allow different views into the database, then we could use different RDF properties and resource to represent each view. However, this makes for a large number of extra triples in the database. An alternative is that we can construct a view using a servlet, where we attribute display properties to resources queried by querying the database for the resources we are interested in. Using this technique we can construct many different views of the database without entering the additional display triples into the database itself. Mozilla uses a sophisticated way of specifying this display data using XUL files .
This extremely simple API produces powerful results. Data from different sources can be combined, but information about the recommendation (such as the person who recommended it) is not lost, because of reification.
If there were many individuals using many recommendation systems which had the ability to output their data using XML/RDF, then aggregation services could create views into multiple databases simultaneously. In the next section we argue that the quality of the metadata generated by in his way would be high. We compare this model of aggregating recommendations with alternative models of generation of metadata.
The recommender system described above is a way of generating metadata about the web. We have demonstrated that metadata generated in this way could be combined to produce an aggregate database. If there were many such instances of recommender systems outputting XML/RDF, there would be great potential for the construction of an open directory of web resources, similar to Netscape's Open Directory . However the model underlying such an aggregate recommendations database would not be the same as that used by the Open Directory. This section argues that an aggregate recommendations database would be a better source of metadata about the web than projects such as the Open Directory.
The Open Directory and similar projects are hierarchical, subject-based structures of human-created metadata about webpages. The Open Directory was the first of many similar projects relying on the voluntary contribution of metadata rather than paid contribution. The subject hierarchy is split into sections created and controlled by individuals with a special interest in the subject. Anyone can apply to become an editor, but in recent months there has been stronger evaluation of who can become an editor. Other human-based catalogues also use or part-use the Open Directory's editor-based control, for example the academic subject gateways  and About.com .
A different model of aggregating metadata about the web is the
collaborative filtering model  such as E-pinions, memepool,
Muse.com, Six Degrees . Here, instead of the assignment of
editorial control be subject, individuals' preferences are matched
algorithmically by comparing their preference profile to that of
other individuals or to preferences in aggregate. If people who
like "Java by Example" also tend to like "The Perl cookbook", and I
like "Java by Example", then there is some probability that I will
also like "The Perl cookbook", and a great deal of effort has gone
into discovering what this conditional probability is.
A related technique is to allow individuals to pick profiles of preferences for themselves. They find individuals with preferences or opinions similar or sympathetic to their own and use that person as a guide to resources on web
These are both interesting and successful ways of aggregating metadata. In the next few paragraphs we examine each model briefly in terms of the structural determinants of the quality of the metadata produced by that method. In particular, we look at the incentives created by the model of collection for individuals to create high-quality metadata, and the evaluation and review processes of each model. Consider that human-created metadata is an attempt to create more accurate and productive searches of the web than automatically-harvested search engines can. To fulfil this promise, human-created metadata must be of high quality, since low-quality metadata is created automatically by harvester search-engines, which also have the advantage that they create must larger indexes than the human-created ones can.
The Open Directory's quality control depends on the quality of the editors. Until recently there were few controls on who could become an editor, but now the staff look for people with experience in the subject they are interested in, and are able to pick and choose between candidates. Because editors work voluntarily, editors have the interest and commitment to want to edit a section of the directory. However, because editorial control of a section usually rests in the hands of one person, the editorial policy can also lead to editorial bias. A further difficulty is the creation of cross-references between different subjects, because although resources can be and are catalogued in more than one category, it is difficult to create the social mechanisms to ensure that the correct cross-references are made.
In contrast the collaborative filtering model leads to a diversity of opinion on any given subject, rather than a monolithic view. In addition, the collection of data has a strong evaluative component, because of the need for users of the service to evaluate the metadata collectors provide in order to find the service useful. However, incentives to contribute to a collaborative filtering effort may be low, because there is not the control and potential prestige of editing a section of a directory.
In these two models there is a degree of tension between the incentive to contribute and the evaluation of the metadata. Using aggregate recommendations databases might resolve some of this tension, if we think of recommendations groups as like collaborative bookmarking systems. There is a high incentive for individuals to create their own list of bookmarks because of the bookmarks' usefulness as a miniature catalogue of the web. For the same reason there would be an incentive for groups of individuals to share their bookmarks with each other: because it is useful to share information in this way, provided the interests of the members of the group are close enough to each other. This is why so much of the web is devoted to like-minded individuals sharing their favorite URLs, whether through mailing lists, lists of links, or other methods. A simple, easy to use recommendations system would facilitate this urge to share, while small groups of individuals sharing their recommendations could facilitate the clear definition of groups and resources, rather than a very large site classifying data as it arrives.
If a group of individuals finds a recommendations system useful to that group, for example as a way of storing and sharing bookmarks, then they have an incentive to contribute high quality metadata to that set of recommendations. So for example, a group of individuals who are interested in teaching resources for economics and business might form a recommendations group, and use the system as a useful archive of websites relevant to these subjects.
In this case sanctions are present for member of the group who wish to recommend resources that deviate from the policy of the group. As a last resort, transgressors can be dropped from the group, but ordinarily it will be in their interest to conform to the policy of the group for social reasons and because the recommendations become less useful to them if they are full of spam. A single recommender system is largely self-regulating and collectively controls the kinds of resources that are submitted to the system, provided the group is relatively small.
In terms of evaluation, the Open Directory model is a little like making one's own definitive list of bookmarks; the collaborative filtering model is like having your bookmarks list evaluated by a very large number of people, while the recommender system is a matter of sharing your bookmarks with a small number of like-minded people. Recommender systems in aggregate can be regarded as collaborative filtering with editorial policies and an extra layer of assurance about the quality of the metadata. The aggregation of recommendations could also improve the links between different subjects in a hierarchical catalogue, because graphs automatically overlay each other, so that resources catalogued under different subjects would have different subjects associated with them. Aggregating recommendations databases means that there need not be just one perspective on each subject.
This paper has argued that a recommendations system that outputs XML/RDF could be used to create an aggregate catalogue of metadata about the web from multiple recommendations groups. We have argued that this would lead to metadata that was group edited and was therefore more reliable than data submitted to a collaborative filtering site or an open directory. We have also argued that in small groups it is in the interest of individuals to use bookmarks, and therefore the incentive to contribute will be high.
The model implies that aggregations services would be very important for selecting, classifying, and providing a user interface to recommendations groups, and could form a focal points within the web for accessing this potentially huge amount of free metadata. The free open directory data has been used by several portals as a place to search, enhancing the value of the data by providing a convenient interface to it and being enhanced by the availability of the data. It is becoming increasingly clear that major sites on the web are those providing user-orientated networked services with an HTML interface. The aggregation of recommendations could mean a vast quality of high-quality data is created for free. We have shown a way in which, this could be very positive for the web, creating new swathes of high-quality human-created metadata about the web in an accessible form, provided that open standards and vocabularies were maintained.
I would like to thank Dan Brickley, Martin Poulter, Damian Steer and Jasper Tredgold for their comments on this paper.