MSAI Thesis

The overview below was originally written for my thesis committee and a few others in the late 90s. Thus it uses a more academic voice than I usually use on the rest of this site. I say this just so you don't think I'm too weird. ;)

You can download my thesis and the corresponding code (links below). Be warned, the code may not be the prettiest in the world. I was in a huge hurry and my committee was more concerned with the thesis itself (and thus I had to be), rather than clean code. Plus it was my first large project in Python.

The Code: protospider.tar.gz
[This code is in the public domain]

Eikenberry, John. ProtoSpider: Using Semantic Networks for Keyword Analysis and Document Classification. (the thesis)


What is it

For the person who deals with the internet as an important information resource, filtering through the vast amounts of information available can be a difficult task. With over 260 million web pages, 50,000 newsgroups, plus mailing lists, it is nearly impossible to wade through it all to find the pertinent resources.

To filter through this vast amount of information, the filter needs to know something about the topic/subject that the user is interested in. Currently, two models of doing this are common. There is the 'dumb' search engine, that does simple keyword searches, sometime with help from boolean operators. Then there are systems that seem to hit the other extreme, with complex NLP and full dictionary, a prior, semantic networks.

This is a system that falls between these two extremes. It adapts to the users changing needs by learning about their interests. It is based on a database of keywords, their relations and the relevance of these to the topic. The database is modeled after semantic networks, capturing the meaning of the words in relation to the topic, by capturing the meaningfulness of their syntactic context. This semantic network database serves the purpose of a profile of the users interests in the topic at hand.

How does it work

What the system does is analyze the information source (a web page in the demo), grab all the relevant keywords and their relations, then compare these keywords and their relations to those stored in the profile. The page is rated based on how well the analysis of the document matches with the profile. Ratings range from -1 to 1, reflecting the positive or negative scores of the keywords/relations in the profile. Negative scores reflect the keyword/relation's property of being counter to the topic (the keyword/relation is indicative that the document isn't relevant to the topic), while positive score mean the opposite.

The user then provides feedback to the system so that it may adjust the profile to better match the users interest in the subject matter.

Another important aspect/contribution of this system is how it is designed. It is made up entirely of agents that work together toward the goal. Each agent handles a specific task, some examples from the demo are an agent that only parses out the keywords and their relations, another agent maintains the semantic network database (updating the profiles), and yet another is responsible for displaying the results of the rating process to the user. (see paper for breakdown of agents)

Why it is important

New ways of exploiting the vast amount of information available online are necessary. The currently most popular system fail on some major points. The most common system today, the search engine, is much to simple and lacks any understanding of the topic. At the other extreme are some fairly new systems coming out, but these are grossly expensive and suffer from their size and complexity. They are not that good at adapting, trying to know everything a prior, and they aren't very extensible or flexible.

This system is an attempt to hit a spot between these two extremes. Building on the simple idea of the search engines, using keywords and their relations, this system keeps from getting overly complicated and bogged down by very complex and expensive systems while still understanding the topic well enough to filter out the good, relevant information from the junk.

The agent based design is also important, because this encourages a re-usability of the agents in ways unlike other programming metaphors. Each agent is treated as its own self-contained service provider with no specific api, besides the communications protocol that all the agents share. This serves to create a system that is very flexible (existing agents are self contained and easily modified) and extensible (adding new agents is easy).