Rule-based metadata crosswalks using RDF

A case study using classification scheme mapping

Author: Dan Brickley <daniel.brickley@bristol.ac.uk>
Version: 1.0 (April 16th 1999)

Latest version: http://purl.org/net/rdf/papers/classmap/

Abstract

This document describes a simple worked demonstration showing how RDF combined with an inference engine can be used to provide integrated views of data organised using differing metadata vocabularies. We take as a common example the problem of integrating divergent but mappable subject-classification schemes. On the basis of this, we propose an extension of our approach as a possible formalism within which metadata harmonisation and 'crosswalk' agreements might be articulated.

STATUS NOTE

Warning! -- I believe some of the technical details of these examples are broken (badly cut'n'pasted from elsewhere etc.). The mechanics of this have been demonstrated to work, but the specific examples here may contain bugs.
--dan


Background: A Common Problem

(Or... "Whatever happened to the Warwick Framework")

The data used here has been provided by two "Subject-based Internet gateways"[1]. The problem addressed in this demo is that such gateways, while providing well organised repositories of internet resource descriptions, invariably use different classification systems. In this case, we have a combination of data from Biz/ed (Business and Economics[1]) and SOSIG (Social Science resources [2]). Biz/ed uses a subset of the Dewey Decimal Classification, while SOSIG use a subset of the UDC scheme. The examples that follow explore the possibilities created by having a machine-understandable representation of both the classification data and of the relationships between different classification systems.

RDF presents us with a simple graph-based formalism for representing such problems. Within RDF, differenting vocabularies are represented as defining a number of 'resources' within a 'namespace'. To make everything unambiguous, we use URI identifiers to name each concept. In this paper, we use the abbreviations 'UDC' and 'DDC' as short human-friendly abbreviations for longer URI namespaces (eg. urn:dewey: or http://org.dewey/ or whatever namespace should finally be used for such purposes by the owners of these vocabularies). The key point is that, by adopting the same logical model to represent diverse metadata element sets, we can then express logical rules about the inter-relationships and mappings which describe how data can be transformed into multiple, overlapping RDF representations.

Classification Mapping: A Case Study

The remainder of this paper focusses on the classification scheme mapping application that we built in 1998. Our final remarks consider the extension of this technique to well known probems addressing metadata initiatives such as the Dublin Core, Instructional Management Systems(IMS) and INDECS.

Software and example data from this case study will be made available from the RDF-DEV web site. Earlier version of this work used the CORAL inference engine; we are now using the SiLRI system, and plan to make an integrated downloadable demonstrator package available so that the ideas proposed here can be easily tested elsewhere.

Note: these examples currently use a non-XML representation of RDF data graphs; the SiLRI engine can however load XML/RDF files using the SiRPAC parser plus a set of rules mapping URI-qualified properties onto shorter local names. An interactive RDF-based demonstrator using this dataset is in preparation.

Data and Rules

Data

There are three files for this demo: sosbizdata.simple, aboutschemes.krl and rules.krl. The first two contain facts about web resources and the schemes used to classify them; the latter contains rules for inferencing based on the facts contained in the data files. Queries against this knowledge base are expressed in an additional file. For simplicity, we use abbreviations such as 'UDC:658.8' instead of giving full URIs for the classification scheme entries.

The main data file contains 10,685 assertions from the two resource description databases, taking the form:

(excerpt from sosbiz.simple):
	subject("http://asem.inter.net.th/","UDC:327").
	subject("http://asem2.fco.gov.uk/","UDC:327").
	subject("http://asiadata.pacificnet.com/","UDC:312").
	subject("http://web.mit.edu/sloan/www/","DDC:378").
	subject("http://web.mit.edu/sloan/www/","DDC:658").
	title("http://wwwgateway.ciesin.org/","CIESIN Information Network").
	title("http://wwwhost.cc.utexas.edu/ftp/pub/grg/gcraft/contents.html","Geographer's Craft Project ").
	title("http://wwwhost.cc.utexas.edu/world/instruction/index.html","World Lecture Hall").

These constitute a subset of the properties used in the two databases of resource descriptions. SOSIG descriptions ascribe 'subject' properties to web resources of the form "UDC:nnn", while Biz/ed descriptions use 'DDC:nnn'. The file aboutschemes.krl provides textual labels for these classifications (using the 'label' property), and additionally contains a number of statements about how these various classifications relate to one another. For example:

(excerpt from aboutschemes.krl):

	label("DDC:658.802","Marketing Management").
	label("DDC:658.8","Marketing").
	label("UDC:658","Business and Industrial Management").
	label("DDC:658.87","Retail Marketing").
	bt("DDC:658.802","UDC:658.8").
	bt("DDC:658.87","UDC:658.8").

This example tells us that the DDC concept 'DDC:658.802' has a textual label "Marketing Management" and has as a broader term (bt) the UDC concept "UDC:658.8' (Marketing).

These are over 250 such statements, include 'bt' (broader term), 'nt'(narrower term), and 'synonym' relations between classification schemes. This mapping data derrives from a short study undertaken in the DESIRE project [4].

Rules

In the raw data above, we see statements which tell us facts such as:

The Dewey concept DDC:658.802 (Marketing Management) has as a broader-term the UDC concept UDC:658.8 (Marketing).

This suggests (to a human reader), that all resources which are classified as falling into the Dewey 'Marketing Management' category should also be considered implicitly members of the UDC 'Marketing' category. For web applications to be able to take advantage of this information, we need to make explicit the inferencing rules which allow us to draw such conclusions.

In this application we have used the F-Logic rules language; there are many similar such frameworks we could adopt. The crucial issue is the ability to map Web metadata into such a framework in a common manner.

This is done in the file schemerules.krl:


// "A resource is about a concept if a resource has a subject
// which is that concept..."

FORALL Resource, Concept 
   about(Resource,Concept) <- subject(Resource,Concept).

// "...or is about a synonym of that concept, "

FORALL Resource, Concept,X 
  about(Resource,Concept) :- subject(Resource,X), synonym(X,Concept).
FORALL Resource, Concept,X 
   about(Resource,Concept) :- subject(Resource,X), synonym(Concept,X).


// "or is about a concept that is a broader term than that concept"

FORALL Resource, Concept,X 
  about(Resource,Concept):- subject(Resource,Z), bt(Concept,Z).

The ability to express rules in this way opens up a great many possibilities. Software which understands declarative rules of this sort can answer queries which require knowledge of the facts implied by these rules.

This has immediate application in the creation of more intuitive user interfaces: unimportant differences (such as the choice of classification system) can be hidden from end-users. A user who expressed an information requirement using one scheme (eg. Dewey) could have the query satisfied by a resource which was never explicitly classified using that scheme. Since the web contains a great number of well organised repositories of information, but has no global classification system, such facilities would be extremely useful.

Just as this works for classification schemes, it should equally apply to traditional metadata elements such as 'title', 'creator','creator name' and so forth. We can be sure of this, since the underlying logical framework, RDF, makes no semantic distinction between these differing applications of RDF.

Worked Example

Users view
  1. User wants to find resources on Marketing
  2. User expresses this (via some graphical interface) as a query for the URL of web pages that are "about" the subject identified as "658.8" in the UDC classification system.
  3. Search agent consults the RDF database, with a query such as:
    FORALL U <- about (U, "UDC:658.8").
    (meaning "Find all the resources U where U has an "about" property with value UDC:658.8").
  4. The database looks for resources which are either explicitly stated to have that property, or which have an implied "--about-->UDC:658.8" property, when rules such as those above are taken into account.
  5. User gets list of resources matching their logical request, rather than just matching resources classified in the same scheme as the request.
    What happens
    Here is a transcript of the results of sending such a query to an instance of the inference engine loaded with the rules and data described above.
    
    
    	Query:
    	=====
    	FORALL U <- about (U, "UDC:658.8").
    
    	Results:
    	=======
    	U = "http://193.132.29.192/"
    	U = "http://cwis.kub.nl/~FEW/few/BE/marketin/journal1.htm"
    	U = "http://medoc.ipl.co.uk/mw0001/"
    	U = "http://www.adage.com/"
    	U = "http://www.aig.org/"
    	U = "http://www.aig.org/aig/home.html"
    	U = "http://www.ama.org/"
    	U = "http://www.ama.org/pubs/educator/index.html"
    	U = "http://www.ama.org/pubs/jm/index.html"
    	U = "http://www.ama.org/pubs/jminfo/index.html"
    	U = "http://www.ama.org/pubs/jmr/"
    	U = "http://www.ama.org/pubs/jppm/"
    	U = "http://www.ama.org/pubs/mm/pub1.htm"
    	U = "http://www.ama.org/pubs/smt/pub7.htm"
    	U = "http://www.amso.co.uk/index.htm"
    	U = "http://www.apnet.com/www/catalog/mk.htm"
    	U = "http://www.bowker-saur.co.uk/"
    	U = "http://www.businesseurope.com/index.html"
    	U = "http://www.chamber.co.uk/"
    	U = "http://www.cim.co.uk/"
    	U = "http://www.corporate-id.com/html/index.htm"
    	U = "http://www.dmu.ac.uk/Schools/Business/marketing.html"
    	U = "http://www.dmworld.com/"
    	U = "http://www.dti.gov.uk/ots/emic/"
    	U = "http://www.enterprisezone.org.uk"
    	U = "http://www.eubusiness.com/"
    	U = "http://www.euromonitor.com/"
    	U = "http://www.imc.org.uk/imc/coursewa/mba/mm-index.htm"
    	U = "http://www.industry.net/"
    	U = "http://www.iocom.be/pilot/cybermarketing/home.html"
    	U = "http://www.ipa.co.uk/"
    	U = "http://www.keynote.co.uk/"
    	U = "http://www.kingston.ac.uk/~bs_s570/kings1.htm"
    	U = "http://www.kingston.ac.uk/~market1/"
    	U = "http://www.lancs.ac.uk/users/mansch/manageme/depts/mktg.htm"
    	U = "http://www.lbs.ac.uk/marketin/wp_tt_95.htm"
    	U = "http://www.mailbase.ac.uk/lists/cti-acc-marketing/"
    	U = "http://www.mailbase.ac.uk/lists/retail/"
    	U = "http://www.man-bus.mmu.ac.uk/retmktg/index.html"
    	U = "http://www.marketingcouncil.org/"
    	U = "http://www.marketinguk.co.uk/"
    	U = "http://www.mcb.co.uk/cgi-bin/journal1/ejm"
    	U = "http://www.mcb.co.uk/cgi-bin/journal1/imr"
    	U = "http://www.mcb.co.uk/cgi-bin/journal1/jbim"
    	U = "http://www.mcb.co.uk/cgi-bin/journal1/jmpams"
    	U = "http://www.mcb.co.uk/jbim.htm"
    	U = "http://www.mcb.co.uk/liblink/ejm/jourhome.htm"
    	U = "http://www.mcb.co.uk/liblink/imr/jourhome.htm"
    	U = "http://www.mcb.co.uk/liblink/jmpams/jourhome.htm"
    	U = "http://www.mcb.co.uk/mlf/"
    	U = "http://www.mcb.co.uk/portfolio/mip/jourinfo.htm"
    	U = "http://www.mce.be/"
    	U = "http://www.milfac.co.uk/bisindex.html"
    	U = "http://www.monash.edu.au/marketing/mktjourn.html"
    	U = "http://www.n2h2.com/KOVACS/S0007s.html"
    	U = "http://www.scenario-planning.com/"
    	U = "http://www.stir.ac.uk/marketing/"
    	U = "http://www.stir.ac.uk/marketing/irs/eaercd/eaercd.htm"
    	U = "http://www.strath.ac.uk/Departments/Marketing/Index.html"
    	U = "http://www.templeton.ox.ac.uk/www/insts/oxirm/oxirm.htm"
    	U = "http://www.trainingforum.com/MRT/index.html"
    	U = "http://www2000.ogsm.vanderbilt.edu/"
    	(62 matches)
    
    
    Logical versus physical...

    Note that these are logical matches, and not simply the result of text-string matching against the raw 'physical' dataset.

    The query above finds 62 cases in which a resource is considered to be 'about' Marketing; the data itelf only tells us that 42 resources fall into the UDC:658.8 category. This can be checked by doing a query such as "FORALL U <- subject (U, "UDC:658.8")", since our rules do not imply the presence of 'subject' except where there is a fact in the raw data to that effect. The remaining 20 resource in the result set above are not explicitly classified as UDC:658.8.

    Example Inference

    How did the system know that http://www.stir.ac.uk/marketing/ was about "Marketing (UDC:658.8)" ?
    The website http://www.stir.ac.uk/marketing/ is physically represented in the raw data as follows:

    
    subject("http://www.stir.ac.uk/marketing/","DDC:378").
    subject("http://www.stir.ac.uk/marketing/","DDC:658.8").
    title("http://www.stir.ac.uk/marketing/","University of Stirling, Department of Marketing").
    

    This information is a subset of the properties ascribed to that resource by a human cataloguer at the Business and Economics web catalogue. The problem here is a mismatch between the vocabularies used to describe resource and to locate it. The cataloguer said, in effect, "this resource has a subject of DDC:658.8" while the user query is expressed as a request for resources that are the subject UDC:658.8.

    We know from our metadata about the schemes themselves that Dewey and UDC have very similar, in fact synonymous, categories for marketing. This can be expressed in machine-readable format:

    label("DDC:658.8","Marketing").
    label("UDC:658.8","Marketing").
    synonym("DDC:658.8","UDC:658.8").
    

    So... at this stage we have the following relevant facts:

    	subject("http://www.stir.ac.uk/marketing/","DDC:658.8").
    	synonym("DDC:658.8","UDC:658.8").
    

    We also have the following relevant logical rules:

    FORALL Resource, Concept,X 
      about(Resource,Concept) :- subject(Resource,X), synonym(X,Concept).
    
    FORALL Resource, Concept,X 
       about(Resource,Concept) :- subject(Resource,X), synonym(Concept,X).
    

    So... we know that the Resource "http://www.stir.ac.uk/marketing/" has a subject of "DDC:658,8", and we know that "DDC:658,8" has a synonym "UDC:658.8". This lets us conclude that "http://www.stir.ac.uk/marketing/" is about "UDC:658.8":

    
    FORALL Resource, Concept,X 
      about(Resource,Concept) :- subject(Resource,X), synonym(X,Concept).
    
      ie. we can conclude:
    
    	about("http://www.stir.ac.uk/marketing/","UDC:658.8")
    
    	because we know that:
    	
    	subject("http://www.stir.ac.uk/marketing/","DDC:658.8")
    	AND 
    	synonym("DDC:658.8","UDC:658.8")
    
    

    Similarly, the inference engine could follow a longer chain of reasoning, for example to take greater advantage of the hierarchical nature of classification systems, or to use a common central scheme to provide a 'switching language' or interlingua for cases where there is no immediate mapping between the vocabulary used by a query and the vocabulary used to classify a resource.

    Further Examples

    These mechanisms could form the basis for RDF Query services. For example, we might want to find out about "http://www.stir.ac.uk/marketing/" by sending a query to a trusted metadata bureau asking about it's classification. The following (clumsy) query shows how the rules about would allow such a query to be serviced, hiding the details of the actual records stored in the database.

    
    
    Query:
    =====
    
    FORALL Title, Classification,  Label 
    <- about ("http://www.stir.ac.uk/marketing/",Classification),
       label (Classification, Label),
       title ("http://www.stir.ac.uk/marketing/",Title).
    
    Results:
    =======
    Title = "University of Stirling, Department of Marketing"
    Classification = "DDC:378"
    Label = "Higher Education Departments"
    
    Title = "University of Stirling, Department of Marketing"
    Classification = "DDC:658.8"
    Label = "Marketing"
    
    Title = "University of Stirling, Department of Marketing"
    Classification = "UDC:658.8"
    Label = "Marketing"
    


    Metadata Harmonisation

    A further application of this approach is as a formal representation of agreements and mappings agreed in metadata harmonisation efforts such as those currently underway relating to the Dublin Core initiative. This in turn suggests a re-conceptualisation of the notion of a Metadata Registry. If we had a standard mechanism for expressing these logical rules for RDF, then inter-relationships between metadata vocabularies would just become more metadata, ie. more machine-understandable statements. These might be made available via one or more well-known starting points on the Web (a 'registry' in the old sense). More importantly, the registered mapping would be "just another RDF graph" and as such, serialisations of this graph into XML could be stored anywhere on the Web, and digital signatures over those stored RDF asssertions could be used to verify the provenence of the claimed mappings.

    A simple (and fictional) example follows. Note again that we use short name abbreviations to stand in for full Web URIs. Each concept identified here should have a unique name on the so that such rules can be shared and combined unambiguously.

    Mapping legacy Dublin Core into DC 1.0
    DC-OLD:Creator.Email ( Resource, EmailAddress)
    :-
    DC1:Creator(Resource, X),
    RDF:value(X,EmailAddress),
    DCQ:CreatorType(X,DCQ:Email).
    
    Example:
    DC-OLD:Creator.Email(http://www.ilrt.bris.ac.uk/discovery/rdf-dev/,
    				"daniel.brickley@bristol.ac.uk")
    IMPLIES...
    DC1:Creator(http://www.ilrt.bris.ac.uk/discovery/rdf-dev/, resource_1),
    RDF:value(resource_1, "daniel.brickley@bristol.ac.uk"),
    DCQ:CreatorType(X,DCQ:Email).
    
    (this could be drawn out as an RDF data graph)
    
    Mapping DC 1.0 into an 'agents based' model (fictional!)

    Similarly, we might assert (and others might agree or otherwise, hence potential of Digital Signatures) that the presence of various Dublin Core 1.0 property qualifiers is indicative of the existence of one or more "implied entities", for example, the shadowy "Secret Agent" concept which appears to be implied by the Dublin Core Creator, Publisher and Contributor elements. Similarly, we can map across namespaces just as easily as within them: RDF doesn't care so long as each unique identifiable thing has a unique identifier. If IMS or INDECS concepts we given Web identifiers (and these might be URLs or URNs, since URIs are a superset of these), we can use them in RDF assertions that describe crosswalks.

    Conclusions

    A powerful framework has been outlined here showing how logical rules expressed over RDF data can be used to express sophisticated inter-vocabulary mappings in a mechanically processable fashion. By adopting RDF's simple graph-based data model, and by using the URI Web identifier to name the components of our metadata vocabularies, it becomes possible to use metadata to describe metadata. Vocabularies (ontologies, schemata, classification systems) becomes first class occupants of the Web, and RDF is used to express rich inter-relationships between those objects. We have demonstrated the utility of a rules-based formalism defined over RDF. The prospect of standardising upon some such mechanism is attractive as it suggests a way in which problems facing metadata vocabulary harmonisation agreements, crosswalks and versioning issues might be articulated in a machine-underderstandable formalism.


    Acknowledgements

    The case study reported here is based on collaborative work undertaken with Stefan Decker and Janne Saarela. This is described in more detail in the paper 'A Query Service for RDF submitted to the W3C Query Languages for the Web workshop in December 1998. Note: a draft of this case-study portion of this paper was previously circulated under the title Classification Scheme Mapping - a simple demonstration using RDFIE and SiRPAC.

    The discussion of the application of this technology to metadata harmonisation issues has it origins in ongoing discussion with Eric Miller of the Dublin Core initiative, and more recently, from the insights of the IMS (Tom Wason, Steve Griffin) and INDECS (Godfrey Rust) projects about the fundamental similarities underlying these currently divergent vocabularies.

    This work is supported in part by the European Union Telematics for Research project DESIRE, by JISC through the JTAP technology applications programme and, previously, under the Electronic Libraries programme which funded the ROADS and Biz/ed resource discovery projects.




    References


    [1] Subject Based Information Gateways (a DESIRE report)
    http://www.desire.org/results/discovery/cat/sbig_des.htm
    [2] Biz/ed, a Business and Economics Internet Catalogue
    URL: http://www.bized.ac.uk/
    [3] SOSIG, the Social Science Information Gateway
    URL: http://www.sosig.ac.uk/
    [4] Mapping Classification Schemes
    URL: http://www.desire.org/results/discovery/cat/mapclass_des.htm