Weblog Recommender service using Machine Learning and Semantic Web technologies

Simon Price, 2003-02-06

This project aims to produce an experimental Web Service that when given a reference to a Weblog or news feed, uses Machine Learning and Semantic Web technologies to return a list of similar Weblogs or news feeds.

Weblogs (Blogs) are personal publishing Web sites that syndicate their content for inclusion in other sites using XML-based file formats known as RSS. Weblogs themselves frequently include links to content syndicated from other Weblogs and it is now becoming common for organisations to use RSS as a way of circulating news about themselves and their business. RSS version 1.0 supports richly expressive metadata in the form of RDF - a key component of the emerging Semantic Web.

In the first instance the student will develop a minimalist web crawler application in Java to follow http links from Weblog to Weblog and compile a database of RSS channels to be used as test data. The next step will be to develop a second Java application to retrieve RSS data as RDF from each of the channels in the database and use appropriate Machine Learning methods to cluster and classify the channels. This initial processing will be run in a batch mode although, if time permits, an incremental approach could also be investigated. Either way, the student will then need to develop this program into a Java Servlet that accepts queries consisting of one or more references to Weblogs and returns a list of similar references as a list of links in a simple Web page. The final stage of the project will be to replace the Web page orientated interface to the recommender with a Simple Object Access Protocol (SOAP) based interface, turning the system into a basic Web Service.

Code written in the project will be in Java; the student will also need to learn to use a toolkit for processing RDF (for example Redland or Jena) and a Java Servlet container (for example Apache Tomcat). Also, some use of an existing machine learning toolkit will also be required (for example Weka or MLC++).

The student would be supervised by Simon Price (simon.price@bristol.ac.uk) at ILRT, in the Semantic Web group. Please contact him for aninformal chat about the project if you are interested.

References

ILRT's Semantic web group
http://www.ilrt.bris.ac.uk/projects/semantic_web

Semantic Web
http://www.w3.org/2001/sw/

Weblogs examples (search on Google)
http://www.google.co.uk/search?hl=en&ie=ISO-8859-1&q=weblogs

RDF
http://www.w3.org/RDF/

RSS 1.0
http://web.resource.org/rss/1.0/

SOAP
http://www.w3.org/TR/SOAP/

Jena
http://www.hpl.hp.com/semweb/jena.htm

Redland
http://www.redland.opensource.ac.uk/

Weka
http://www.cs.waikato.ac.nz/ml/weka/index.html

MLC++
http://www.sgi.com/Technology/mlc

Tomcat
http://jakarta.apache.org/