IMeshTk ILRT Home

IMesh DB: A Model and Implementation of Information Discovery for the IMesh Toolkit

Author: Libby Miller

Date: 2001-02-14
Latest version:

Abstract

This paper is about modelling, storing and querying complex metadata about people and organisations. It describes the use of an ABC-like metadata model to represent contextual information about the discovery and creation of metadata. It also describes the use of the Squish query language and Java implementation to store and query this metadata in RDF.
See also the demo.

Contextualizing Metadata Generation

The IMesh toolkit is a project which aims to improve the software available to Subject Gateways and other quality-controlled metadata aggregators. This part of the project aims to increase the amount of information about who the subject gateways are, and so who the 'subject gateway community' is, with reference to the members of the IMesh mailing list, a list set up at the first IMesh workshop in 1998.

The idea of the IMesh DB was to aid the project with scoping issues (who are the subject gateways?) and to enable the project to gather information about the technology requirements of the subject gateways. Since there is no clear way of technically separating out the subject gateways from similar providers of quality-assessed information about Internet resources [1] but the project was designed to help the 'subject gateway community', there was a need to scope the problem more directly. One source of information about the people in the subject gateways community was the IMesh mailing list [2], a mailing list explicitly set up for discussion about subject gateways.

From information about people obtained from the mailing list (email address and name), we then needed to find information about the sorts of projects and services they worked on, information that was often publicaly available, but also in diverse locations on the web.

In the process of researching the members of the IMesh mailing list, we began to think about how to represent the information we were discovering. For example, by searching on a person's email address in Google [3], we could find out more information about them, perhaps from a signature file in a post to a mailing list. From that we could find out who they worked for and what projects they worked on. While doing this we recognised that some sources of this type of information were more trustworthy than others. To an extent, the trustworthiness of a document can be evaluated by the person looking at it; however mistakes can be made, and we will sometimes need to remove all the information from a certain source from the database. Trustworthiness will also depend on who found the information and when. What, who and when provide the means to contextualize the discovery and reporting of information found on the web. This made us think about how to model the process of information discovery, rather than just the information we had discovered.

Our plan for the IMesh database was to open it up so that anyone from the list could add information to it, about themselves, their projects or organizations, or about people or organisations they knew about. This provided a further incentive for careful modelling of the relationships between context, discoverer and content, because this lack of access control and decentralized nature of the database implied that there might be need for deletion of data from certain sources. In this environment the trustability of the information is also likely to depend on who it is claimed stated it (although in a completely non-access controlled system the person who it is claimed said it cannot be assumed to be the person who actually said it; the important thing is to identify the unit of assertions so that they can be removed from the database if required).

Contextualizing metadata about web pages is an important aspect of resource discovery by quality-assessed, hand-described resource discovery services such as Sosig [4]. Attributing generated metadata to certain people at a certain time is required for internal quality control and auditing, examples being link checking and ensuring the descriptions of the sites are accurate. In addition, attributing records to people is an additional level of quality assurance, where the qualifications and experience of the resource describers can be assessed independently of the organisation or service they work for.

What we were doing - describing people and organisations using web pages as references - is similar but not identical to the process of writing metadata about webpages. Instead of a general description of the aspects of the webpage, we was making more detailed claims about what the webpage said about people and things. So the emphasis was not on describing the page, but on describing aspects of what the page said. This leads to difficult questions of attribution and of the legal and ontological status of discovered data; and it requires that the relationship between the context, the discoverer, the webpage, and the content of the assertions discovered in the web page be precisely defined.

Modeling the IMesh Data

My research is co-funded by IMesh toolkit [5] and the Harmony project [6]. The Harmony project is a series of initiatives aimed at understanding modelling the metadata of complex objects in an unambiguous way, focussing on hard cases such as rights management. It uses an event-centric approach, which we considered appropriate for describing the process of creation of the metadata, rather than just the direct metadata about the web resource.

Although not restricted to using Resource Description Framework (RDF), the ABC model is fully compatible with RDF and may described in it, either an an entity-relationship diagram or using the RDF XML serialization syntax. Using RDF both to model and store the data gave us the opportunity to try some of the RDF tools that we have been working on. Using RDF also has the advantage that the resultant data is machine processable using freely available tools, and the model of the data is extremely extensible(see conclusions below).

This section describes a little of the theory behind the ABC model, and then describes in some detail my approach to modeling the IMesh data. Section 3 describes the use of an RDF query language and implementation (Squish [7]) to store and query the data.

The ABC Model

The ABC model is a 'Logical Model for Metadata Interoperability' meaning that it attempts to define the commonalities between different metadata vocabularies. One way the model facilitates interoperability is by defining certain sorts of things that commonly need to be described in different ontologies, such as people, documents, times and places. Additionally, ongoing work in ABC has focussed on defining precisely the relationships between people, documents and so on. An important aspect of the model is its event-centric approach, in which a resource is transformed or created within a certain context, that is, at a certain date, time and location. we felt that an approach such as this could enable the unambiguous definition of the act or event of creating metadata from a certain source.

One important part of the ABC model is the idea of inputs and outputs to an event. An example might a review event, whose input is a book and whose output is a review document.

The review writing event has a context - a date, time and place - which means that we can disambiguate it from other similar review writing events.

In a similar way, we can describe reporting of content on a webpage as a content report event, similar to a review or annotation event but without the creation of new content. This is a similar idea to the creation of a catalogue of metadata about resources, but with the emphasis not on the describing of some page, but on the describing of people or things, with the page as a reference.

So in a content creation event the input is a webpage or other addressable source of information, and the output is metadata, which is simultaneously about the page and about the things described on the page. One can view this process as a formalization of information in html into RDF, and so it could be seen as a stopgap until all information is described in RDF or some other machine-processable format.

Here is a node-arc diagram of the ABC-like model we used for the description of the IMesh data (click for full image).

The model step by step

So now we have a reasonably clear model in RDF of a report of (some of) the content of a webpage. We can represent this in XML/RDF:

         <rdf:RDF xmlns:abc="http://xmlns.com/abc/1.0/"
         xmlns:an="http://rdf.desire.org/vocab/recommend.rdf#"
         xmlns:dc="http://purl.org/dc/elements/1.1/"
         xmlns:foaf="http://xmlns.com/foaf/0.1/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:rdfs="http://www.w3.org/TR/WD-rdf-schema#"
         xmlns:sm="http://rdf.desire.org/vocab/sitemap.rdf#">
         <an:contentReportEvent>
                 <abc:contribution>
                         <abc:Action>
                                 <abc:agent>
                                         <foaf:person rdf:about="mailto:liy0ZwaLv9UTM">
                                                 <foaf:mbox rdf:resource="mailto:liy0ZwaLv9UTM" />
                                         </foaf:person>
                                 </abc:agent>
                                 <abc:date>Thu Jan 18 13:13:52 GMT+00:00 2001</abc:date>
                         </abc:Action>
                 </abc:contribution>
                 <abc:input rdf:resource="http://www.ilrt.bris.ac.uk/aboutus/staff?search=cmmlp" />
                 <abc:output>
                         <rdf:Bag>
                                 <rdf:li>
                                         <rdf:Description>
                                                 <rdf:subject rdf:resource="mailto:m.1RS3IJVLxHI" />
                                                 <rdf:predicate rdf:resource="http://xmlns.com/foaf/0.1/projectHomepage" />
                                                 <rdf:object rdf:resource="http://econltsn.ilrt.bris.ac.uk/" />
                                                 <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement" />
                                         </rdf:Description>
                                 </rdf:li>
                                 <rdf:li>
                                         <rdf:Description>
                                                 <rdf:subject rdf:resource="mailto:m.1RS3IJVLxHI" />
                                                 <rdf:predicate rdf:resource="http://xmlns.com/foaf/0.1/workplaceHomepage" />
                                                 <rdf:object rdf:resource="http://www.ilrt.bris.ac.uk/" />
                                                 <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement" />
                                         </rdf:Description>
                                 </rdf:li>
                                 <rdf:li>
                                         <rdf:Description>
						<rdf:subject rdf:resource="mailto:m.1RS3IJVLxHI" />
                                                 <rdf:predicate rdf:resource="http://xmlns.com/foaf/0.1/name" />
                                                 <rdf:object>Martin Poulter</rdf:object>
                                                 <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement" />
                                         </rdf:Description>
                                 </rdf:li>
                         </rdf:Bag>
                 </abc:output>
         </an:contentReportEvent>
         </rdf:RDF>

Implementing an RDF database in Squish

Having decided on an RDF model for the data, we then created a database to store and query the data using Squish, the SQL-like query language for RDF [7]. We implemented Squish in Java over the top of a fairly standard RDF API using a JDBC interface, so that any database which offers a simple RDF API and a simple JDBC driver can be queried using Squish. We used a Postgres Database for the backend, and JSPs to generate and search the data. We had already done most of the work for this so setting it up was a relatively trivial task.

As a first pass, we created JSP pages to generate the RDF and insert it into the database (using the XML representation and SiRPAC, the RDF parser), storing the data as a table of the basic triples which SiRPAC outputs. This had been sufficient for a number of small databases we had created for similar purposes. We created a search interface which allowed query of the database by email address or by url. So, entering the (exact) mail address libby.miller@bristol.ac.uk into the web form would bring back the person with that mail addresses' name, the projects they worked on, and their workplace homepage. The Squish query we used for this was


 SELECT ?predicate, ?object 
 WHERE
(rdfs::subject ?rs mailto:libby.miller@bristol.ac.uk)
(rdfs::object ?rs ?object)
(rdfs::predicate ?rs ?predicate)   
 USING rdfs for http://www.w3.org/1999/02/22-rdf-syntax-ns#
 

Similarly, entering (exactly) http://www.imesh.org/toolkit/ would bring back any descriptions and names associated with that project. Here is the Squish query used:

 
 SELECT ?predicate, ?object 
 WHERE
(rdfs::subject ?rs http://www.imesh.org/toolkit/)
(rdfs::object ?rs ?object)
(rdfs::predicate ?rs ?predicate)   
 USING rdfs for http://www.w3.org/1999/02/22-rdf-syntax-ns#

Once people made an initial query as one of these two, they could browse around the links between people on the same project or at the same workplace, using the same sorts of queries.

Note that these queries exploit the fact that we have identified people with their mailbox address, and documents with their url. If we do not do this, then the queries become substantially more complicated: instead of


 SELECT ?predicate, ?object 
 WHERE
(rdfs::subject ?rs mailto:libby.miller@bristol.ac.uk)
(rdfs::object ?rs ?object)
(rdfs::predicate ?rs ?predicate)   
 USING rdfs for http://www.w3.org/1999/02/22-rdf-syntax-ns#

we have


 SELECT ?predicate, ?object 
 WHERE
(foaf::mbox ?person mailto:libby.miller@bristol.ac.uk)
(rdfs::subject ?rs ?person)
(rdfs::object ?rs ?object)
(rdfs::predicate ?rs ?predicate)   
 USING rdfs for http://www.w3.org/1999/02/22-rdf-syntax-ns#
and foaf for http://xmlns.com/foaf/0.1/

Discussion

The immediate problem with this approach was that the initial search people needed to do to get somewhere to start browsing from had to be exactly some mail address or url in the database. Looking at the logs, people were typing less well-known urls for the same organization or project. This problem indicates that a more sophisticated approach is needed to model webpages, which may have more than one url. But the more significant problem was simply a usability issue - people are used to searching by keyword, and won't be bothered to find out or use a url or email address.

So our next approach was to create a general search of the database for keyword. We thought it would be straightforward: a crude way is simply do a search on the object field of the RDF database. However, there were several problems with this approach. One comes from the way Squish is implemented. Currently Java Squish cannot devolve query planning and other features to the underlying database if it is capable of it. So text matching is done after the RDF query is made, by the Java code itself. This means that a general search of the object field requires that every object be pulled out of the database to be matched against the keyword.

Our alternative strategy was to try to reduce the number of resources lifted from the database by searching by person or by organisation/project, as appropriate. So for person we query the objects of reified statements with predicate name; for projects and organisations, we query the object of reified statements with predicate title or description. But because of the reified structure of the data meaning that the data is very similar, this still means that searching was very slow, since the query still pulls out all the reified statements and processes them.

So finally we decided that a statements table was required to speed up this kind of search (and incidentally also the more specific searches as well, because of the overhead of making a two or three-clause query rather than one). We altered the input classes to track reified statements and store them into a separate table, as if they were non-reified triples. For example, using a generic database table

subject predicate object ID

we go from

genid:14 http://www.w3.org/1999/02/22-rdf-syntax-ns#subject mailto:m.1RS3IJVLxHI var:1234524365
genid:14 http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate http://xmlns.com/foaf/0.1/projectHomepage var:8765546660
genid:14 http://www.w3.org/1999/02/22-rdf-syntax-ns#object http://econltsn.ilrt.bris.ac.uk/ var:1609056967678
genid:14 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/1999/02/22-rdf-syntax-ns#resource var:1632456562

to

mailto:m.1RS3IJVLxHI http://xmlns.com/foaf/0.1/projectHomepage http://econltsn.ilrt.bris.ac.uk/ genid:14

In fact, this is how many applications (e.g.[9]) store RDF triples, with a flag for whether the triple is reified or not. Then the problem was how to tell the database via the RDF API and the JDBC interface and Squish that we wanted to search in the statements table of the SQL database rather than the default triples table. To solve this we changed the query parser to allow a FROM clause which set the table to search on.

This is a hack; the storage of statements requires a great deal more thought. The problem with storing all triples as statements with a 'reified or not' flag is that complex queries have to be rewritten to accomodate the fact that the triples storage is not straightforward. It is no problem for basic queries of reified triples of the form ?s ?p ?o; the problem occurs when you might want to know about nodes which point to the reified node, which is no longer stored as a node in its own right. This aspect of the problem has not been resolved.

However, the searching was now fast, and the database worked quite well. Currently it only displays basic information about people and organisations/projects, using the following queries:

 
 SELECT ?rs, ?object 
 FROM statements 
 WHERE
  (http://purl.org/dc/elements/1.1/description ?rs ?object)
 AND ?object ~ term 


 SELECT ?rs, ?object 
 FROM statements 
 WHERE
  (http://purl.org/dc/elements/1.1/description ?rs ?object)
 AND ?object ~ term 


 SELECT ?rs, ?object 
 FROM statements 
 WHERE
  (http://xmlns.com/foaf/0.1/name ?rs ?object) 
 AND ?object ~ term  

for text-matching queries, and


 SELECT ?predicate, ?object 
 FROM statements
 WHERE (?predicate mbox ?object)


 SELECT ?person, ?predicate, ?object 
 FROM statements 
 WHERE 
 (?predicate ?person uri);


 SELECT ?predicate, ?object 
 FROM statements 
 WHERE 
 (?predicate uri ?object)

However, all the information is present in the database, so that we can also specify where the information came from, who found it, and how up to date it is. We have implemented this using an experimental page, using the following query:


 SELECT ?person, ?input, ?pred, ?obj 
 FROM triples 
 WHERE 
 (http://www.w3.org/1999/02/22-rdf-syntax-ns#subject ?rs mbox)
 (http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate ?rs ?pred) 
 (http://www.w3.org/1999/02/22-rdf-syntax-ns#object ?rs ?obj) 
 (?bag ?li ?rs) 
 (http://xmlns.com/abc/1.0/output ?crevent ?li) 
 (http://xmlns.com/abc/1.0/input ?crevent ?input) 
 (http://xmlns.com/abc/1.0/contribution ?crevent ?agent) 
 (http://xmlns.com/abc/1.0/agent ?agent ?person) 

This is extremely slow to run using the current configuration. The plan is to test various backends to improve the speed of very complex queries such as this one.

Squish queries

There are general difficulties with the approach Squish takes in making queries like this when applied to bags. In the query above, the clause

(?bag ?li ?rs)

i.e. with three free variables, picks out every triple in the database. This is because even though at this point in the query the value of the ?rs variable will be known, Squish currently performs queries in a very simple fashion. It does not substitute in known values for variables, but makes the query as it stands and then joins the resultsets together. For more about Squish see [10].

A more generic problem is with reified statements queries. As mentioned above, queries over reified statements can be extremely slow, because of the potentially large number of reified statements in a given database, and because of the complexity of finding an exact match for a reified triple: the need for three or four subqueries. This is because Squish runs over the top of a very generic RDF API, essentially all it needs to function is that the RDF database have an insert (RDFGraph) method and a triplesMatching(subject, predicate, object) method. With this constraint, many of the possible options for overcoming reified statements become impossible. For example, this API only allows one clause at a time to be queried from the database, and so it is not possible for the query engine to take into account that it is being asked a query about a reified statement (possibly through an improved syntax) and ask the database directly about its reified statements if it has a better storage method. Similarly, even if the query engine knew that the database had more efficient methods of storing the reified data, the RDF API limits the amount of additional information you can get back about a triple; for example the triple's identifier, which could be a statement identifier is not returned from the SQL database because the in-memory RDF database cannot store this information, and there is no API to return it; even if it could, the ResultSet metaphor is not a good way to return this information, because it does not return triples but rows.

One way forward is to have the database define certain things that it can do, for example query planning or text matching in SQL, or support for statements, or direct handling of Squish-type queries, and let the query engine know about these features. We will have to define a suitably flexible interface for this. This would allow the plugging of more sophisticated RDF databases to work with Squish Java.

Another way forward is to redesign the schema used for the data to use a less complex format. Work in the Harmony project suggests that it might be possible to model events such that reification is not necessary; however, this work has not progressed sufficiently to be used in this context.

Conclusions

The complexity of the RDF approach in XML can be off-putting, and storage questions have yet to be answered, however, the ABC RDF model is a good way of clearly disambiguating aspects of the complex information you may want to store in a resource discovery context. RDF as XML can be very verbose, especially with respect to reification, but nevertheless, the reification of statements is currently the only well-defined way to state that X said 'Y' in a nodes/arcs context. The strategy used is workable, although not currently scaleable using this software. A similar approach would also be suitable for annotations.

The main advantage apart from modelling precision that RDF gives us is the flexibility to change or add to our model without altering the structure of the underlying database. This is especially useful in the IMesh DB case, because one can envisage different sorts of connections between individuals and organisations than those described in the very simple structure defined here. One might also wish to add further information about the people and the organisations.

References

[1] http://www.ilrt.bris.ac.uk/discovery/2000/09/imesh/
ImeshTk: Subject Gateway Review Literature Review

[2] http://www.jiscmail.ac.uk/lists/imesh.html
IMesh mailing list

[3] http://www.google.com
Google

[4] http://www.sosig.ac.uk/ (Sosig)
http://www.sosig.ac.uk/about_us/user_support.html (training materials)
http://www.sosig.ac.uk/desire/internet-detective.html (Internet Detective)

[5] http://www.imesh.org/toolkit/
IMesh toolkit

[6] http://ilrt.org/discovery/harmony/
Harmony project

[7] http://swordfish.rdfweb.org/rdfquery/
Squish

[8] http://rdfwebring.org/2000/08/why/
RDFWeb Intro

[9] for example:
http://lists.w3.org/Archives/Public/www-rdf-interest/1999Nov/0042.html

[10] paper on Squish