ILRT: Metadata Research: Misc Demos: Rudolf PortalProxy


Portal Proxy

This component provides, thanks to a useful Perl module, an XML/RDF interface to many major search services. Please read the disclaimer before using!

Difficulty: not rocket science


We don't escape the returned data carefully enough to be sure that the output will always be well formed XML.

Screen-scraping applications are inherently unstable, since they rely on fragile accidents of HTML layout to extract machine-processable data. With WWW::Search, the Perl module must be periodically reinstalled to update the plugins that 'know' the current HTML style of each engine.

Lack of a common query language: we send simple words and phrases to each site; how that site interprets them is not specified in this interface.



See Also: Sherch

We've also been working on Sherch!, a set of Perl tools that allows us to replace the WWW::Search backend component of this demo with parameterised screen scraping, ie. we read a text file that tells which bits of the HTML to throw away or keep. There are various ways of doing this; we're working on a common representation in RDF but for now have implemented a parser/extractor based on the Apple Sherlock plugin format.

An early version of Sherch is now available. Current functionality: it can learn about new search services by extracting data from Sherlock plugins (you can give it a URL for a new search engine), and can present a simple parallel-search interface to allow searching of multiple sherch-interfaced services.