RightNow Logo

RightNow provides a strategic solution to drive superior customer experiences
...while dramatically reducing costs.

Customer Experience Report

Colloquium Papers

Neal Richter - May 1, 2008 - Web Search Ranking

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites From User Activity
Mikhail Bilenko and Ryen W. White
Microsoft Research

The paper proposes identifying relevant information sources from the history of combined searching and browsing behavior of many Web users. While it has been previously shown that user interactions with search engines can be employed to improve document ranking, browsing behavior that occurs beyond search result pages has been largely overlooked in prior work. The paper demonstrates that users’ post-search browsing activity strongly reflects implicit endorsement of visited pages, which allows estimating topical relevance of Web resources by mining large-scale datasets of search trails. We present heuristic and probabilistic algorithms that rely on such datasets for suggesting authoritative websites for search queries. Experimental evaluation shows that exploiting complete post-search browsing trails outperforms alternatives in isolation (e.g., clickthrough logs), and yields accuracy improvements when employed as a feature in learning to rank for Web search.

Mike Harrelson - April 24, 2008 - Neural networks

Confabulation theory
Dr. Robert Hecht-Nielsen
University of California, San Diego

Confabulation theory offers a comprehensive detailed explanation of the mechanism of thought (i.e., “cognition”: vision, hearing, reasoning, language, planning, origination of movement and thought processes, etc.) in humans and other vertebrates (and possibly in invertebrates, such as octopi and bees). For expositional simplicity, only the human case is considered here.

Steve Durbin - April 17, 2008 - Prediction Markets

Using Prediction Markets to Track Information Flows: Evidence from Google
Bo Cowgill*, Justin Wolfers**, and Eric Zitzewitz***
*Google; **Wharton School, Univ. of Pennsylvania; ***Dartmouth College

In the last 2.5 years, Google has conducted the largest corporate experiment with prediction markets we are aware of. In this paper, we illustrate how markets can be used to study how an organization processes information. We document a number of biases in Google’s markets, most notably an optimistic bias. Newly hired employees are on the optimistic side of these markets, and optimistic biases are significantly more pronounced on days when Google stock is appreciating. We find strong correlations in trading for those who sit within a few feet of one another; social networks and work relationships also play a secondary explanatory role. The results are interesting in light of recent research on the role of optimism in entrepreneurial firms, as well as recent work on the importance of geographical and social proximity in explaining information flows in firms and markets.

Neal Richter - April 10, 2008 - Artificial Intelligence

AI Chasers
Patrick Tucker
World Future Society

The “holy grail” of computer science may be within reach. A futurist looks toward tomorrow’s Artificial Intelligence Revolution.

The article was based on these interviews:
MIT roboticist Rodney Brooks, Adaptive A.I. Inc. founder Peter Voss, Self-Aware Systems founder Steve Omohundro, Powerset CEO Barney Pell, and Google research director Peter Norvig discuss how they see AI developing in the years ahead, when a human-level AI might emerge, and how worried we should be about that whole killer-robot-goes-on-rampage scenario.

Neal Richter - April 3, 2008 - Pattern Mining

Clustering Co-occurrences of Maximal Frequent Patterns in Streams
Edgar H. de Graaf, Joost N. Kok, Walter A. Kosters
Leiden Institute of Advanced Computer Science, Leiden University

One way of getting a better view of data is by using frequent patterns. In this paper frequent patterns are (sub)sets that occur a minimal number of times in a stream of itemsets. However, the discovery of frequent patterns in streams has always been problematic. Because streams are potentially endless it is in principle impossible to say if a pattern is often occurring or not. Furthermore, the number of patterns can be huge and a good overview of the structure of the stream is lost quickly. The proposed approach will use clustering to facilitate the “online” analysis of the structure of the stream.

A clustering on the co-occurrence of patterns will give the user an improved view on the structure of the stream. Some patterns might occur so often together that they should form a combined pattern. In this way the patterns in the clustering will approximate the largest frequent patterns: maximal frequent patterns. The number of (approximated) maximal frequent patterns is much smaller and combined with clustering methods these patterns provide a good view on the structure of the stream.

Our approach to decide if patterns occur often together is based on a method of clustering where only the distance between pairs of patterns is known. This distance is the Euclidean distance between points in a 2-dimensional space, where the points represent the frequent patterns, or rather the most important ones. The coordinates are adapted when the records from the stream pass by, and reflect the real support of the corresponding pattern. In this setup the support is viewed as the number of occurrences in a time window. The main algorithm tries to maintain a dynamic model of the data stream by merging and splitting these patterns. Experiments show the versatility of the method.

Richard McAllister - March 27, 2008 - Pattern Mining

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
Jiawei Han*, Jian Pei**, Yiwei Yin***, Runying Mao****
*Univ. of Illinois, **SUNY, ***Simon Fraser Univ., ****Microsoft

Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns.

In this study, we propose a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-treebased mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods.

Steve Durbin - March 6, 2008 - Data commons

Urban Sensing: Out of the Woods
Dana Cuff, Mark Hansen, and Jerry Kang

Embedded networked sensing, having successfully shifted from the lab to the environment, is primed for a more contentious move to the city to where citizens will likely be the target of data collection. This transition will warrant careful study and touch on issues that go far beyond the scientific realm.

Neal Richter - February 28, 2008 - Contextual Advertising

Finding Advertising Keywords on Web Pages
Wentau Yih,* Joshua Goodman,* Vitor R. Carvalho**
*Microsoft Research, **Carnegie Mellon University

A large and growing number of web pages display contextual advertising based on keywords automatically extracted from the text of the page, and this is a substantial source of revenue supporting the web today. Despite the importance of this area, little formal, published research exists. We describe a system that learns how to extract keywords from web pages for advertisement targeting. The system uses a number of features, such as term frequency of each potential keyword, inverse document frequency, presence in meta-data, and how often the term occurs in search query logs. The system is trained with a set of example pages that have been hand-labeled with “relevant” keywords. Based on this training, it can then extract new keywords from previously unseen pages. Accuracy is substantially better than several baseline systems.

Richard McAllister - February 21, 2008 - Clustering

Hierarchical Document Clustering Using Frequent Itemsets
Benjamin C.M. Fung, Ke Wangy, and Martin Esterz
Simon Fraser University

A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.

Neal Richter - February 14, 2008 - Search Advertising

Optimal Delivery of Sponsored Search Advertisements Subject to Budget Constraints
Zoe Abrams, Ofer Mendelevitch, John A. Tomlin

We discuss an auction framework in which sponsored search advertisements are delivered in response to queries. In practice, the presence of bidder budgets can have a significant impact on the ad delivery process. We propose an approach based on linear programming which takes bidder budgets into account, and uses them in conjunction with forecasting of query frequencies, and pricing and ranking schemes, to optimize ad delivery. Simulations show significant improvements in revenue and efficiency.

Steve Durbin - February 7, 2008 - Web economics

Three topics from The Technium, a book in progress
Kevin Kelly

How much does one search cost?

The Value of Search

Better Than Free

Steve Durbin - January 31, 2008 - Search

Trusted Search Communities
Peter Briggs and Barry Smyth
Adaptive Information Cluster, School of Computer Science and Informatics, University College Dublin

We describe a social search technique that harnesses the search experiences of a community of searchers to generate result recommendations, in a collaborative fashion, to complement results returned from some underlying search engine. We describe a dynamic model of trust as a way to coordinate collaboration, and provide experimental results to show that search performance improves as the network evolves.

Steve Durbin - December 6, 2007 - Design

Creativity Support Tools: Accelerating Discovery and Innovation
Ben Shneiderman
Dept. of Computer Science, Univ. of Maryland, College Park

How can designers of programming interfaces, interactive tools, and rich social environments enable more people to be more creative more often?

Anthony Arnone - November 29, 2007 - Search

SOLR and Lucene

This week we’ll be covering the search engine Lucene and one of its main sub-projects, Solr. I’ll be giving an overview of the concepts behind these two packages, and how you can use them to easily add expansive search capability to just about any Java environment you want.

Lucene is a search engine written in Java that provides high scalability, fast searching, robust handling of generic document structures, and a highly pluggable architecture that makes adding features easy. Lucene has a very active development community and is consistently on the ‘cutting edge’ of search engine technologies.

Solr is a servlet-based front end for the Lucene search engine. It provides XML-based configuration and index schema management, APIs in XML/HTTP, master-slave style replication, caching, and a web based admin interface. It also includes a bunch of trendy features for modern search engines such as faceting, more-like-this, and JSON responses, along with a host of other cool features under development.

Anthony's Powerpoint presentation

Steve Durbin - October 25, 2007 - Ontologies

Deriving a Large Scale Taxonomy from Wikipedia
Simone Paolo Ponzetto and Michael Strube
EML Research, Germany

We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexico-syntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e. isa, relations. We evaluate the quality of the created resource by comparing it with ResearchCyc, one of the largest manually annotated ontologies, as well as computing semantic similarity between words in benchmarking datasets.

Steve Durbin - October 18, 2007 - Text similarity

Improving Similarity Measures for Short Segments of Text
Wen-tau Yih and Christopher Meek
Microsoft Research

In this paper we improve previous work on measuring the similarity of short segments of text in two ways. First, we introduce a Web-relevance similarity measure and demonstrate its effectiveness. This measure extends the Web-kernel similarity function introduced by Sahami and Heilman (2006) by using relevance weighted inner-product of term occurrences rather than TF×IDF. Second, we show that one can further improve the accuracy of similarity measures by using a machine learning approach. Our methods outperform other state-of-the-art methods in a general query suggestion task for multiple evaluation metrics.

Neal Richter and Steve Durbin - October 4, 2007 - Information Retrieval

Search Engines that Learn from Implicit Feedback
Thorsten Joachims and Filip Radlinski
Cornell University

Search-engine logs provide a wealth of information that machine-learning techniques can harness to improve search quality. With proper interpretations that avoid inherent biases, a search engine can use training data extracted from the logs to automatically tailor ranking functions to a particular user group or collection.

Neal Richter - September 27, 2007 - Information Retrieval

Learning to Rank for Information Retrieval Using Genetic Programming
Jen-Yuan Yeh*, Jung-Yi Lin*, Hao-Ren Ke**, Wei-Pang Yang***
*Dept. of Computer Science, National Chiao Tung University
**Inst. of Information Management, National Chiao Tung University
***Dept. of Information Management, National Dong Hwa University

One central problem of information retrieval (IR) is to determine which documents are relevant and which are not to the user information need. This problem is practically handled by a ranking function which defines an ordering among documents according to their degree of relevance to the user query. This paper discusses work on using machine learning to automatically generate an effective ranking function for IR. This task is referred to as "learning to rank for IR" in the field. In this paper, a learning method, RankGP, is presented to address this task. RankGP employs genetic programming to learn a ranking function by combining various types of evidences in IR, including content features, structure features, and query-independent features. The proposed method is evaluated using the LETOR benchmark datasets and found to be competitive with Ranking SVM and RankBoost.

Steve Durbin - September 20, 2007 - Knowledge base organization

Collaborative Structuring: Organizing Document Repositories Effectively and Efficiently
Harris Wu* and Michael Gordon**
*Old Dominion University; **University of Michigan

Improving how documents are managed and data is stored and exchanged in systems.

Steve Durbin - September 13, 2007 - Chatbots

A Terrorism Question/Answer System
R. P. Schumaker,* Y. Liu,** M. Ginsburg,*** and H. Chen***
*IS Dept, Iona College; **IS Dept., Cal State Long Beach; ***Dept. of Managmement Information Systems, U. of Arizona

The TARA Project examined how a trio of modified chatterbots could be used to disseminate terrorism-related information to the general public.

Ben Werner - August 23, 2007 - HCI

Tangible Bits: Towards Seamless Interfaces between People, Bits and Atoms
Hiroshi Ishii and Brygg Ullmer, MIT Media Lab

This paper presents our vision of Human Computer Interaction (HCI): "Tangible Bits." Tangible Bits allows users to "grasp & manipulate" bits in the center of users’ attention by coupling the bits with everyday physical objects and architectural surfaces. Tangible Bits also enables users to be aware of background bits at the periphery of human perception using ambient display media such as light, sound, airflow, and water movement in an augmented space. The goal of Tangible Bits is to bridge the gaps between both cyberspace and the physical environment, as well as the foreground and background of human activities.
This paper describes three key concepts of Tangible Bits: interactive surfaces; the coupling of bits with graspable physical objects; and ambient media for background awareness. We illustrate these concepts with three prototype systems – the metaDESK, transBOARD and ambientROOM – to identify underlying research issues.

Steve Durbin - August 2, 2007 - Search

How Well does Result Relevance Predict Session Satisfaction?
Scott B. Huffman, Michael Hochster, Google, Inc.

Per-query relevance measures provide standardized, repeatable measurements of search result quality, but they ignore much of what users actually experience in a full search session. This paper examines how well we can approximate a user’s ultimate session-level satisfaction using a simple relevance metric. We find that this relationship is surprisingly strong. By incorporating additional properties of the query itself, we construct a model which predicts user satisfaction even more accurately than relevance alone.

Ben Werner - July 19, 2007 - Usability

Examining the Robustness of Sensor-Based Statistical Models of Human Interruptibility
James Fogarty, Scott E. Hudson, HCI Institute, Carnegie-Mellon
Jennifer Lai, IBM T.J. Watson Research Center

Current systems often create socially awkward interruptions or unduly demand attention because they have no way of knowing if a person is busy and should not be interrupted. Previous work has examined the feasibility of using sensors and statistical models to estimate human interruptibility in an office environment, but left open some questions about the robustness of such an approach. This paper examines several dimensions of robustness in sensor-based statistical models of human interruptibility. We show that real sensors can be constructed with sufficient accuracy to drive the predictive models. We also create statistical models for a much broader group of people than was studied in prior work. Finally, we examine the effects of training data quantity on the accuracy of these models and consider tradeoffs associated with different combinations of sensors. As a whole, our analyses demonstrate that sensor-based statistical models of human interruptibility can provide robust estimates for a variety of office workers in a range of circumstances, and can do so with accuracy as good as or better than people. Integrating these models into systems could support a variety of advances in human computer interaction and computer-mediated communication.

Neal Richter - June 21, 2007 - Search Ranking

Google Keeps Tweaking Its Search Engine
Saul Hansell
New York Times

Article on the Search Quality group at Google

See also the keynote presentation at SIGIR 2005: Challenges in Running a Commercial Search Engine by Amit Singhal of Google

Neal Richter - June 7, 2007 - Search Engines

Why writing your own search engine is hard
Anna Patterson
Stanford University

There must be 4,000 programmers typing away in their basements trying to build the next “world’s most scalable” search engine. It has been done only a few times. It has never been done by a big group; always one to four people did the core work, and the big team came on to build the elaborations and the production infrastructure. Why is it so hard? We are going to delve a bit into the various issues to consider when writing a search engine. This article is aimed at those individuals or small groups that are considering this endeavor for their Web site or intranet. It is fun, but a word of caution: not only is it difficult, but you need two commodities in short supply—time and patience.

Zuzana Gedeon - May 17, 2007 - Search

Multi-factor Clustering for a Marketplace Search Interface
Neel Sundaresan, Kavita Ganesan and Roopnath Grandhi
eBay Research Labs

Search engines provide a small window to the vast repository of data they index and against which they search. They try their best to return the documents that are of relevance to the user but often a large number of results may be returned. Users struggle to manage this vast result set looking for the items of interest. Clustering search results is one way of alleviating this navigational pain. In this paper we describe a clustering system that enables clustering search results in an online marketplace search system.

Adaptive Faceted Browser for Navigation in Open Information Spaces
Michal Tvarožek and Mária Bieliková
Faculty of Informatics and Information Technologies, Slovak University of Technology

Open information spaces have several unique characteristics such as their changeability, large size, complexity and diverse user base. These result in novel challenges during user navigation, information retrieval and data visualization in open information spaces. We propose a method of navigation in open information spaces based on an enhanced faceted browser with support for dynamic facet generation and adaptation based on user characteristics.

Zuzana Gedeon, Hasari Tosun, Steve Durbin - April 12, 2007 - WWW 2007

Navigating the Intranet with High Precision
Huaiyu Zhu, Alexander Loser, Sriram Raghavan, and Shivakumar Vaithyanathan
IBM Almaden Research Center and SAP Research CEC

Despite the success of web search engines, search over large enterprise intranets still suffers from poor result quality. Earlier work that compared intranets and the Internet from the view point of keyword search has pointed to several reasons why the search problem is quite different in these two domains. In this paper, we address the problem of providing high quality answers to navigational queries in the intranet (e.g., queries intended to find product or personal home pages, service pages, etc.). Our approach is based on offline identification of navigational pages, intelligent generation of termvariants to associate with each page, and the construction of separate indices exclusively devoted to answering navigational queries. Using a testbed of 5.5M pages from the IBM intranet, we present evaluation results that demonstrate that for navigational queries, our approach of using custom indices produces results of significantly higher precision than those produced by a general purpose search algorithm.

The Complex Dynamics of Collaborative Tagging
Harry Halpin, Valentin Robu, and Hana Shepherd
University of Edinburgh; Center for Math and CS (Amsterdam); Princeton University

The debate within the Web community over the optimal means by which to organize information often pits formalized classications against distributed collaborative tagging systems. A number of questions remain unanswered, however, regarding the nature of collaborative tagging systems including whether coherent categorization schemes can emerge from unsupervised tagging by users. This paper uses data from the social bookmarking site del.icio.us to examine the dynamics of collaborative tagging systems. In particular, we examine whether the distribution of the frequency of use of tags for "popular" sites with a long history (many tags and many users) can be described by a power law distribution, often characteristic of what are considered complex systems. We produce a generative model of collaborative tagging in order to understand the basic dynamics behind tagging, including how a power law distribution of tags could arise. We empirically examine the tagging history of sites in order to determine how this distribution arises over time and to determine the patterns prior to a stable distribution. Lastly, by focusing on the high-frequency tags of a site where the distribution of tags is a stabilized power law, we show how tag co-occurrence networks for a sample domain of tags can be used to analyze the meaning of particular tags given their relationship to other tags.

Identifying Ambiguous Queries in Web Search
Ruihua Song, Zhenxiao Luo, Ji-Rong Wen, Yong Yu, and Hsiao-Wuen Hon
Shanghai Jiao Tong University, Microsoft Research Asia, Fudan University

It is widely believed that some queries submitted to search engines are by nature ambiguous (e.g., java, apple). However, few studies have investigated the questions of "how many queries are ambiguous?" and "how can we automatically identify an ambiguous query?" This paper deals with these issues. First, we construct the taxonomy of query ambiguity, and ask human annotators to manually classify queries based upon it. From manually labeled results, we find that query ambiguity is to some extent predictable. We then use a supervised learning approach to automatically classify queries as being ambiguous or not. Experimental results show that we can correctly identify 87% of labeled queries. Finally, we estimate that about 16% of queries in a real search log are ambiguous.

Steve Durbin - April 5, 2007 - Personalization

Open User Profiles for Adaptive News Systems: Help or Harm?
Jae-wook Ahn, Peter Brusilovsky, Jonathan Grady, Daqing He and Sue Yeon Syn
School of Information Sciences, University of Pittsburgh

Over the last five years, a range of projects have focused on progressively more elaborated techniques for adaptive news delivery. However, the adaptation process in these systems has become more complicated and thus less transparent to the users. In this paper, we concentrate on the application of open user models in adding transparency and controllability to adaptive news systems. We present a personalized news system, YourNews, which allows users to view and edit their interest profiles, and report a user study on the system. Our results confirm that users prefer transparency and control in their systems, and generate more trust to such systems. However, similar to previous studies, our study demonstrate that this ability to edit user profiles may also harm the system’s performance and has to be used with caution.

Steve Durbin - March 29, 2007 - Reputation Systems

A Content-Driven Reputation System for the Wikipedia
B. Thomas Adler and Luca de Alfaro
University of California, Santa Cruz

We present a content-driven reputation system for Wikipedia authors. In our system, authors gain reputation when the edits they perform to Wikipedia articles are preserved by subsequent authors, and they lose reputation when their edits are rolled back or undone in short order. Thus, author reputation is computed solely on the basis of content evolution; user-to-user comments or ratings are not used. The author reputation we compute could be used to flag new contributions from low-reputation authors, or it could be used to allow only authors with high reputation to contribute to controversial or critical pages. A reputation system for the Wikipedia could also provide an incentive for high-quality contributions.
We have implemented the proposed system, and we have used it to analyze the entire Italian and French Wikipedias, consisting of a total of 691,551 pages and 5,587,523 revisions. Our results show that our notion of reputation has good predictive value: changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, as judged by human observers, and of being later undone, as measured by our algorithms

Steve Durbin - March 22, 2007 - Collective Intelligence

Web pundits trumpet social software as a means of generating knowledge by tapping into the wisdom of crowds. The following two articles offer contrarian perspectives.

The "Dumbness of Crowds"
Kathy Sierra

DIGITAL MAOISM: The Hazards of the New Online Collectivism
Jaron Lanier

Neal Richter - March 15, 2007 - Semantic Web

AI Meets Web 2.0: Building the Web of Tomorrow, Today
Jay M. Tenenbaum

Imagine an Internet-scale knowledge system where people and intelligent agents can collaborate on solving complex problems in business, engineering, science, medicine, and other endeavors. Its resources include semantically tagged websites, wikis, and blogs, as well as social networks, vertical search engines, and a vast array of web services from business processes to AI planners and domain models. Research prototypes of decentralized knowledge systems have been demonstrated for years, but now, thanks to the web and Moore’s law, they appear ready for prime time. This article introduces the architectural concepts for incrementally growing an Internet-scale knowledge system and illustrates them with scenarios drawn from e-commerce, e-science, and e-life.

Steve Durbin - March 8, 2007 - Tag Mining

Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development
Benjamin Szekely and Elias Torres
Harvard University

Online communities are an important forum for people to share information on the internet. Historically, online communities have been centered around free-text sharing and simple user rating systems. With the shadow of the Semantic Web looming behind many new innovations on the internet today, online communities are beginning to adopt semantic techniques such as tagging and folksonomies as a vehicle for the sharing and classification of information. Gourmetvillage.org is a prototype online community that uses tagging to provide an innovative restaurant rating mechanism. Users may provide free-text reviews and tags for any aspect of a restaurant. UserRank is an algorithm based on Google’s PageRank that provides a ranking of users based on whose taggings are most often followed. TagRank provides a ranking of tags based on the ranking of users. UserRank may in fact be used to rank any entity in a community based on user association. del.icio.us contains a significant dataset of users and tagged entities with which to test these algorithms. UserRank and TagRank yield different rankings than their count-based counterparts. When drilling down into the results, the top user according to UserRank only made a few taggings, yet hundreds of users in the system agreed with his taggings, showing that these algorithms capture community consensus fairly well.

Steve Durbin - March 1, 2007 - Tag Mining

Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems
Paul Heyman and Hector Garcia-Molina
Computer Science Department, Stanford University

Collaborative tagging systems -- systems where many casual users annotate objects with free-form strings (tags) of their choosing -- have recently emerged as a powerful way to label and organize large collections of data. During our recent investigation into these types of systems, we discovered a simple but remarkably effective algorithm for converting a large corpus of tags annotating objects in a tagging system into a navigable hierarchical taxonomy of tags. We first discuss the algorithm and then present a preliminary model to explain why it is so effective in these types of systems.

Steve Durbin - February 15, 2007 - User interface

Clustering versus faceted categories for information exploration
Marti Hearst
School of Information Management and Systems
University of California, Berkeley

Information seekers often express a desire for a user interface that organizes search results into meaningful groups, in order to help make sense of the results, and to help decide what to do next. A longitudinal study in which participants were provided with the ability to group search results found they changed their search habits in response to having the grouping mechanism available [2].

There are many open research questions about how to generate useful groupings and how to design interfaces to support exploration using grouping. Currently two methods are quite popular: clustering and faceted categorization. Here, I describe both approaches and summarize their advantages and disadvantages based on the results of usability studies.

Steve Durbin - February 8, 2007 - User interface

Simplicity is Highly Overrated is a short essay on the notion of simplicity in user interfaces.

Cautious Cars and Cantankerous Kitchens: How Machines Take Control
Don Norman
Professor of Computer Science, Psychology, and Cognitive science
Northwestern University

The first chapter of The Design of Future Things, to be published in 2008. It‘s about the ever-increasing role of automation in our homes and automobiles, why it is being done so badly, with suggestions for doing it right through what I call “natural interaction.”

Steve Durbin - February 1, 2007 - Ambient AI

Machine Learning, Reasoning, and Intelligence in Daily Life: Directions and Challenges
Eric Horvitz
Microsoft Research

Technical developments and trends are providing a fertile substrate for creating and integrating machine learning and reasoning into multiple applications and services. I will review several illustrative research efforts on our team, and focus on challenges, opportunities, and directions with the streaming of machine intelligence into daily life.

Steve Durbin - November 30, 2006 - Natural Language Understanding

Machine Reading
Oren Etzioni, Michele Banko, Michael J. Cafarella
Computer Science & Engineering, University of Washington

The time is ripe for the AI community to set its sights on Machine Reading---the automatic, unsupervised understanding of text. In this paper, we place the notion of “Machine Reading” in context, describe progress towards this goal by the KnowItAll research group at the University of Washington, and highlight several central research questions.

Steve Durbin - November 2, 2006 - Social Networks

The Dynamics of Viral Marketing
Jure Leskovec, Carnegie Mellon
Lada A. Adamic, University of Michigan
Bernardo A. Huberman, HP Labs

We present an analysis of a person-to-person recommendation network, consisting of 4 million people who made 16 million recommendations on half a million products. We observe the propagation of recommendations and the cascade sizes, which we explain by a simple stochastic model. We analyze how user behavior varies within user communities defined by a recommendation network. Product purchases follow a 'long tail' where a significant share of purchases belongs to rarely sold items. We establish how the recommendation network grows over time and how effective it is from the viewpoint of the sender and receiver of the recommendations. While on average recommendations are not very effective at inducing purchases and do not spread very far, we present a model that successfully identfies communities, product and pricing categories for which viral marketing seems to be very effective.

Steve Durbin - October 26, 2006 - User Behavior

The origin of bursts and heavy tails in human dynamics
Albert-Laszlo Barabasi
Center for Complex Networks Research and Dept. of Physics, University of Notre Dame

The dynamics of many social, technological and economic phenomena are driven by individual human actions, turning the quantitative understanding of human behaviour into a central question of modern science. Current models of human dynamics, used from risk assessment to communications, assume that human actions are randomly distributed in time and thus letters to nature well approximated by Poisson processes. In contrast, there is increasing evidence that the timing of many human activities, ranging from communication to entertainment and work patterns, follow non-Poisson statistics, characterized by bursts of rapidly occurring events separated by long periods of inactivity. Here I show that the bursty nature of human behaviour is a consequence of a decision-based queuing process: when individuals execute tasks based on some perceived priority, the timing of the tasks will be heavy tailed, with most tasks being rapidly executed, whereas a few experience very long waiting times. In contrast, random or priority blind execution is well approximated by uniform inter-event statistics. These finding have important implications, ranging from resource management to service allocation, in both communications and retail.

Hasari Tosun - October 19, 2006 - Recommender systems

Unifying User-based and Item-based Collaborative Filtering Approaches by Similarity Fusion
Jun Wang, Arjen P. de Vries, and Marcel J. T. Reinders
Information and Communication Theory Group, Delft University of Technology

Memory-based methods for collaborative filtering predict new ratings by averaging (weighted) ratings between, respectively, pairs of similar users or items. In practice, a large number of ratings from similar users or similar items are not available, due to the sparsity inherent to rating data. Consequently, prediction quality can be poor. This paper reformulates the memory-based collaborative filtering problem in a generative probabilistic framework, treating individual user-item ratings as predictors of missing ratings. The final rating is estimated by fusing predictions from three sources: predictions based on ratings of the same item by other users, predictions based on different item ratings made by the same user, and, third, ratings predicted based on data from other but similar users rating other but similar items. Existing user-based and item-based approaches correspond to the two simple cases of our framework. The complete model is however more robust to data sparsity, because the different types of ratings are used in concert, while additional ratings from similar users towards similar items are employed as a background model to smooth the predictions. Experiments demonstrate that the proposed methods are indeed more robust against data sparsity and give better recommendations.

David Tolliver - October 12, 2006 - Clustering

Note special time (Thursday 3:10) and place (1-139 Wilson)

Spectral Approximation Algorithms: Clustering and Graph Partitioning
David Tolliver
Google Fellow and Department of Computer Science, Carnegie Mellon University

I'll provide an overview of recent advances in spectral clustering and partitioning technology. I'll briefly cover a fundamental result in clustering, and then delve into the basic spectral clustering algorithm. I'll walk through a few variations on the basic algorithm, highlighting the specific applications domains and drawbacks. Finally I'll cover our proof of an approximation bound for the Normalized Cut (an NP-hard objection function) for a simple spectral algorithm.

Steve Durbin - October 5, 2006 - Search

A Live-User Evaluation of Collaborative Web Search
Barry Smyth, Evelyn Balfe, Oisin Boydell, Keith Bradley, Peter Briggs, Maurice Coyle, Jill Freyne
Smart Media Institute, Department of Computer Science, University College Dublin

Collaborative Web search exploits repetition and regularity within the query-space of a community of like-minded individuals in order to improve the quality of search results. In short, search results that have been judged to be relevant for past queries are promoted in response to similar queries that occur in the future. In this paper we present the results of a large-scale evaluation of this approach, in a corporate Web search scenario, which shows that significant benefits are available to its users.

Steve Durbin - September 28, 2006 - Tagging

Visualizing Tags over Time
Micah Dubinko, Ravi Kumar, Joseph Magnani, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins
Yahoo! Research

We consider the problem of visualizing the evolution of tags within the Flickr (flickr.com) online image sharing community. Any user of the Flickr service may append a tag to any photo in the system. Over the past year, users have on average added over a million tags each week. Understanding the evolution of these tags over time is therefore a challenging task. We present a new approach based on a characterization of the most interesting tags associated with a sliding interval of time. An animation provided via Flash in a web browser allows the user to observe and interact with the interesting tags as they evolve over time.
New algorithms and data structures are required to support the efficient generation of this visualization. We combine a novel solution to an interval covering problem with extensions to previous work on score aggregation in order to create an efficient backend system capable of producing visualizations at arbitrary scales on this large dataset in real time.

Neal Richter - September 21, 2006 - Index Mining

Random Sampling from a Search Engine's Index Best Paper Winner at WWW-2006
Ziv Bar-Yossef, Technion - Israel Institute of Technology, Israel Maxim Gurevich, Department of Electrical Engineering, Technion, Israel

We revisit a problem introduced by Bharat and Broder al- most a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sam- pling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample docu- ments, but each sample is accompanied by a corresponding ¿weight¿, which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm. We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to pro- duce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substanti- ate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked doc- uments. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.

Neal Richter - September 14, 2006 - Click-stream Data and Queries

Time-Dependent Semantic Similarity Measure of Queries Using Historical Click-Through Data
Qiankun Zhao, Steven C. H. Hoi, Tie-Yan Liu, Sourav S Bhowmick, Michael R. Lyu, Wei-Ying Ma
Nanyang Technological University, The Chinese University of HK, Microsoft Research Asia

It has become a promising direction to measure similarity of Web search queries by mining the increasing amount of clickthrough data logged by Web search engines, which record the interactions between users and the search engines. Most existing approaches employ the click-through data for similarity measure of queries with little consideration of the temporal factor, while the click-through data is often dynamic and contains rich temporal information. In this paper we present a new framework of time-dependent query semantic similarity model on exploiting the temporal characteristics of historical click-through data. The intuition is that more accurate semantic similarity values between queries can be obtained by taking into account the timestamps of the log data. With a set of user-defined calendar schema and calendar patterns, our time-dependent query similarity model is constructed using the marginalized kernel technique, which can exploit both explicit similarity and implicit semantics from the click-through data effectively. Experimental results on a large set of click-through data acquired from a commercial search engine show that our time-dependent query similarity model is more accurate than the existing approaches. Moreover, we observe that our time-dependent query similarity model can, to some extent, reflect real-world semantics such as real-world events that are happening over time.

Steve Durbin - September 7, 2006 - Ranking the web

Beyond PageRank: machine learning for static ranking
Matthew Richardson, Amit Prakash, and Eric Brill
Microsoft Research, Inc.

Since the publication of Brin and Page's paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3% (vs. 56.7% for PageRank or 50% for random).

Steve Durbin - August 31, 2006 - Better answers

Retroactive Answering of Search Queries
Beverly Yang and Glen Jeh
Google, Inc.

Major search engines currently use the history of a user's actions (e.g., queries, clicks) to personalize search results. In this paper, we present a new personalized service, query-specific web recommendations (QSRs), that retroactively answers queries from a user's history as new results arise. The QSR system addresses two important subproblems with applications beyond the system itself: (1) Automatic identification of queries in a user's history that represent standing interests and unfulfilled needs. (2) Effective detection of interesting new results to these queries. We develop a variety of heuristics and algorithms to address these problems, and evaluate them through a study of Google history users. Our results strongly motivate the need for automatic detection of standing interests from a user's history, and identifies the algorithms that are most useful in doing so. Our results also identify the algorithms, some which are counter-intuitive, that are most useful in identifying interesting new results for past queries, allowing us to achieve very high precision over our data set.

Sam Gardner - April 27, 2006 - Neural networks

Self-Organization of Hierarchical Visual Maps with Feedback Connections
Yiu Fai Sit and Risto Miikkulainen
Department of Computer Sciences, University of Texas at Austin

Visual areas in primates are known to have reciprocal connections. While the feedforward bottom-up processing of visual information has been studied extensively for decades, little is known about the role of the feedback connections. Existing feedback models usually employ hand-coded connections, and do not address how these connections develop. The model described in this paper shows how feedforward and feedback connections between cortical areas V1 and V2 can be learned through self-organization simultaneously. Computational experiments show that both areas can form hierarchical representations of the input with reciprocal connections that link relevant cells in the two areas.

Mike Emery - April 20, 2006 - Self-organization

Decentralised Autonomic Computing: Analysing Self-Organising Emergent Behaviour using Advanced Numerical Methods
Tom De Wolf, Giovanni Samaey, Tom Holvoet and Dirk Roose
Department of Computer Science, KU Leuven, Belgium

When designing decentralised autonomic computing systems, a fundamental engineering issue is to assess systemwide behaviour. Such decentralised systems are characterised by the lack of global control, typically consist of autonomous cooperating entities, and often rely on self-organised emergent behaviour to achieve the requirements.
A well-founded and practically feasible approach to study overall system behaviour is a prerequisite for successful deployment. On one hand, formal proofs of correct behaviour and even predictions of the exact systemwide behaviour are practically infeasible due to the complex, dynamic, and often non-deterministic nature of self-organising emergent systems. On the other hand, simple simulations give no convincing arguments for guaranteeing system-wide properties. We describe an alternative approach that allows to analyse and assess trends in systemwide behaviour, based on so-called equation-free macroscopic analysis. This technique yields more reliable results about the system-wide behaviour, compared to mere observation of simulation results, at an affordable computational cost. Numerical algorithms act at the system-wide level and steer the simulations. This allows to limit the amount of simulations considerably.
We illustrate the approach by studying a particular system-wide property of a decentralised control system for Automated Guided Vehicles and we outline a road map towards a general methodology for studying decentralised autonomic computing systems.

Leif Wickland - April 13, 2006 - Web text mining

Deriving Marketing Intelligence from Online Discussion
Glance et al.
Intelliseek Applied Research Center

Weblogs and message boards provide online forums for discussion that record the voice of the public. Woven into this mass of discussion is a wide range of opinion and commentary about consumer products. This presents an opportunity for companies to understand and respond to the consumer by analyzing this unsolicited feedback. Given the volume, format and content of the data, the appropriate approach to understand this data is to use large-scale web and text data mining technologies. This paper argues that applications for mining large volumes of textual data for marketing intelligence should provide two key elements: a suite of powerful mining and visualization technologies and an interactive analysis environment which allows for rapid generation and testing of hypotheses. This paper presents such a system that gathers and annotates online discussion relating to consumer products using a wide variety of state-of-the-art techniques, including crawling, wrapping, search, text classification and computational linguistics. Marketing intelligence is derived through an interactive analysis framework uniquely configured to leverage the connectivity and content of annotated online discussion.

Neal Richter - April 6, 2006 - Social Networking

System of Mobile Agents to Model Social Networks
Gonzalez, Marta C., Lind, Pedro G. and Herrmann, Hans J.
Univ Stuttgart, Departamento de Fisica, Universidade Federal do Ceara

We propose a model of mobile agents to construct social networks, based on a system of moving particles by keeping track of the collisions during their permanence in the system. We reproduce not only the degree distribution, clustering coefficient and shortest path length of a large data base of empirical friendship networks recently collected, but also some features related with their community structure. The model is completely characterized by the collision rate and above a critical collision rate we find the emergence of a giant cluster in the universality class of two-dimensional percolation. Moreover, we propose possible schemes to reproduce other networks of particular social contacts, namely sexual contacts.

Bob Wall - March 30, 2006 - Models of Neural Processing

Computation in the Higher Visual Cortices: Map-Seeking Circuit Theory and Application to Machine Vision
David Arathorn
Center for Computational Biology, MSU, and General Intelligence Corporation

Map-Seeking Circuit theory is a biologically based computational theory of vision applicable to difficult machine vision problems such as recognition of 3D objects in arbitrary poses amid distractors and clutter, as well as to non-recognition problems such as terrain interpretation. It provides a general computational mechanism for tractable discovery of correspondences in massive transformation spaces by exploiting an ordering property of superpositions. The latter allows a set of transformations of an input image to be formed into a sequence of superpositions which are then culled to a composition of single mappings by a competitive process which matches each superposition against a superposition of inverse transformations of memory patterns. The architecture that performs this is based on a number of neuroanatomical features of the visual cortices, including reciprocal dataflows and inverse mappings.

Steve Durbin - March 9, 2006 - Spelling correction

Learning a Spelling Error Model from Search Query Logs
Farooq Ahmad and Grzegorz Kondrak
University of Alberta

Applying the noisy channel model to search query spelling correction requires an error model and a language model. Typically, the error model relies on a weighted string edit distance measure. The weights can be learned from pairs of misspelled words and their corrections. This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words.

Dave Kortenkamp - March 3, 2006 - Robotics

Note special time (Friday 3:10) and place, 101 Roberts

Worlds to Explore: Autonomy Challenges for Human Space Flight

This talk will look at two distinct artificial intelligence challenges in human space flight. The first challenge can be compared to building HAL 9000 from the movie 2001: A Space Odyssey, that is, autonomous monitoring and control of long duration mission functions. The second challenge can be compared to building Lt. Cmdr. Data from Star Trek: The Next Generation, that is, autonomous control of highly dexterous robots. Both challenges will be highlighted by showing research being done at NASA Johnson Space Center. This research includes autonomous control of advanced life support system tests and control of a humanoid robot called Robonaut. The talk will include many videos of NASA's robotics research.

Dave Kortenkamp's Home Page

Steve Durbin - February 23, 2005 - Language evolution

The "Lexical Contract": Modeling the emergence of word pronunciations

Even in the absence of writing, human languages appear to develop and maintain vocabularies of roughly 10,000 morphemes, and at least 100,000 words or phrases whose meaning is not predictable from their constituent parts. This presents a formidable task for the individual language learner, who must learn words from a small number of uses in context, and must manage to remember such a large inventory of arbitrary pronunciations. On a different scale of space and time, this also presents an extraordinary challenge for the speech community, which must must somehow form and maintain a consensus about these tens of thousands of complex and arbitrary conventions. This is accomplished despite the fact that the conventional knowledge at stake is tacit: the individuals involved typically don't know what they know about word pronunciation, and don't try to discuss the details of this knowledge. ... We will try to find simple programs for interacting agents that lead to successful emergence of patterns of word pronunciation. As in the case of "ant colony optimization", we'll look for algorithms that can accomplish this without requiring any designation of authorities or any explicit negotiation to reach consensus.

Zuzana Gedeon - February 16, 2005 - Text clustering

Topic-Driven Clustering for Document Datasets
Ying Zhao and George Karypis
Dept. of Computer Science and Engineering, Univ. of Minnesota

In this paper, we define the problem of topic-drive clustering, which organizes a document collection according to a set of topics. We propose three topic-driven schemes that consider the similarity between documents and topics and the relationship among documents themselves simultaneously. We present a comprehensive experimental evaluation of the proposed topic-driven schemes on five datasets. Our experimental results show that the proposed topic-driven schemes are efficient and effective with topic prototypes of different levels of specificity.

Steve Durbin - February 9, 2005 - Language Evolution

Language evolution: consensus and controversies
Morten H. Christiansen* and Simon Kirby**
*Cornell University and **Edinburgh University

Why is language the way it is? How did language come to be this way? And why is our species alone in having complex language? These are old unsolved questions that have seen a renaissance in the dramatic recent growth in research being published on the origins and evolution of human language. This review provides a broad overview of some of the important current work in this area. We highlight new methodologies (such as computational modeling), emerging points of consensus (such as the importance of pre-adaptation), and the major remaining controversies (such as gestural origins of language). We also discuss why language evolution is such a difficult problem, and suggest probable directions research may take in the near future.

Neal Richter - February 2, 2005 - Text Mining with Genetic Algorithms

A Semantically Guided and Domain-Independent Evolutionary Model for Knowledge Discovery From Texts
John Atkinson-Abutridy, Chris Mellish, and Stuart Aitken

We present a novel evolutionary model for knowledge discovery from texts (KDTs), which deals with issues concerning shallow text representation and processing for mining purposes in an integrated way. Its aims is to look for novel and interesting explanatory knowledge across text documents. The approach uses natural language technology and genetic algorithms to produce explanatory novel hypotheses. The proposed approach is interdisciplinary, involving concepts not only from evolutionary algorithms but also from many kinds of text mining methods. Accordingly, new kinds of genetic operations suitable for text mining are proposed. The principles behind the representation and a new proposal for using multiobjective evaluation at the semantic level are described. Some promising results and their assessment by human experts are also discussed which indicate the plausibility of the model for effective KDT.

Steve Durbin - January 26, 2005 - Bayesian Modelling

Bayesian models of human action understanding
Chris L. Baker, Joshua B. Tenenbaum and Rebecca R. Saxe
Dept. of Brain and Cognitive Sciences, MIT

We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.

Reggie Mead - January 19, 2005 - Modelling

A Review of AI Techniques for Ecological Modelling
Reginald Mead
Dept. of Computer Science, MSU and RightNow Technologies

Environmental models are used to both further scientific understanding of environmental processes and to aid in environmental decision making. Artificial intelligence techniques can enhance the approaches that have traditionally been followed for modelling. Generally, ecological modelling approaches can be classified as either process-based, empirical-based, or a combination of the two. Process-based models use an understanding of underlying sub-processes to model a system. Expert systems can be constructed from expert knowledge and can be used for prediction, planning, interpretation, and more. Process-based models have high explanatory depth, but also tend to have low predictive power. Empirical-based models explicitly exclude any a priori knowledge of the system, and instead model the system based on observation data alone. ...

Rick Sojda - December 8, 2005 - Modelling

An Artificial Intelligence Modelling Approach to Simulating Animal/Habitat Interactions
H. Saarenmaa l, N.D. Stone 2,4, L.J. Folse 3, J.M. Packard 3, W.E. Grant 3, M.E. Makela 2 and R.N. Coulson 2
1 The Finnish Forest Research Institute
Departments of 2 Entomology, and 3 Wildlife and Fisheries Science, Texas A&M University

Ecological modellers have begun to recognize the potential of object-oriented programming techniques in structuring models. However, little has been done to take advantage of artificial intelligence's (AI) symbolic representations to model the decision-making processes of animals. Here, a generic model of animal-habitat interaction and a specific model of moose-, Alces alces L., forest interactions in Finland are described that are event-driven and behavior-based. Individual level simulation is accomplished through an object-oriented knowledge representation scheme and AI techniques to implement a hierarchical decision-making model of behavior. The habitat is likewise represented in an object-oriented scheme, allowing the simulation of a heterogeneous environment. Other AI techniques for modelling behavior, memory, and actions are discussed including LISP methods, rule-based reasoning, and several search algorithms. Simulations of the moose-forest system show the power of this approach but are not intended to advance the theory of large-herbivore behavior and foraging. AI techniques are found to be most beneficial in (a) studying population processes based on individual level models of behavior, (b) modelling spatial heterogeneity, (c) building event-driven models, (d) providing a conceptual clarity to model construction, and (e) providing a structure equally well suited to simulating resource management.

Neal Richter - December 1, 2005 - Implicit Feedback

A Study of Factors Affecting the Utility of Implicit Relevance Feedback
Ryen W. White,* Ian Ruthven,** and Joemon M. Jose^
*Human-Computer Interaction Laboratory, University of Maryland
**Dept. of Computer and Information Sciences, University of Strathclyde
^Dept of Computing Science, University of Glasgow

Implicit relevance feedback (IRF) is the process by which a search system unobtrusively gathers evidence on searcher interests from their interaction with the system. IRF is a new method of gathering information on user interest and, if IRF is to be used in operational IR systems, it is important to establish when it performs well and when it performs poorly. In this paper we investigate how the use and effectiveness of IRF is affected by three factors: search task complexity, the search experience of the user and the stage in the search. Our findings suggest that all three of these factors contribute to the utility of IRF.

Accurately Interpreting Clickthrough Data as Implicit Feedback
Thorsten Joachims,* Laura Granka,** Bing Pan,^ Helene Hembrooke,^ and Geri Gay^
*Dept. of Computer Science, Cornell University
**Dept. of Communication, Stanford University
^Information Science, Cornell University

This paper examines the reliability of implicit feedback generated from clickthrough data in WWW search. Analyzing the users' decision process using eyetracking and comparing implicit feedback against manual relevance judgments, we conclude that clicks are informative but biased. While this makes the interpretation of clicks as absolute relevance judgments difficult, we show that relative preferences derived from clicks are reasonably accurate on average.

Bob Wall - November 17, 2005 - Clustering for Information Retrieval

Creating Concept Hierarchies in an Information Retrieval System
Bob Wall,*^ Neal Richter,*^ and Rafal A. Angryk^
*RightNow Technologies
^Dept. of Computer Science, Montana State University

Most information retrieval systems are comprised of a focused set of domain-specific documents located within a single logical repository. A mechanism is developed by which user queries against such a system are used to generate a concept hierarchy pertinent to the domain. First, an algorithm is described which extracts terms from documents matching user queries, and then reduces this set of terms to a manageable length. The resulting terms are used to generate a feature vector for each query, and the queries are clustered using a Hierarchical Agglomerative Clustering (HAC) algorithm. The HAC algorithm generates a binary tree of clusters, which is not particularly amenable to use by humans and which is slow to search due to its depth, so a subsequent processing step applies min-max partitioning to form a shallower, bushier tree that is a more natural representation of the concept hierarchy inherent in the system.

Steve Durbin - November 10, 2005 - Web computing

Learning from THE WEB (Word doc)
Learning from THE WEB (ACM Queue online)
Adam Bosworth

The Web has much to teach us about managing and modeling distributed data. It's time we began listening.

Steve Durbin - November 3, 2005 - Web search

Automatic Identification of User Goals in Web Search
Uichin Lee, Zhenyu Liu, and Junghoo Cho
Dept. of Computer Science, University of California, Los Angeles

There have been recent interests in studying the "goal" behind a user's Web query, so that this goal can be used to improve the quality of a search engine's results. Previous studies have mainly focused on using manual query-log investigation to identify Web query goals. In this paper we study whether and how we can automate this goal-identification process. We first present our results from a human subject study that strongly indicate the feasibility of automatic query-goal identification. We then propose two types of features for the goal-identification task: user-click behavior and anchor-link distribution. Our experimental evaluation shows that by combining these features we can correctly identify the goals for 90% of the queries studied.

Steve Durbin - October 27, 2005 - Visualization

Arc Diagrams: Visualizing Structure in Strings
Martin Wattenberg
IBM Research

This paper introduces a new visualization method, the arc diagram , which is capable of representing complex patterns of repetition in string data. Arc diagrams improve over previous methods such as dotplots because they scale efficiently for strings that contain many instances of the same subsequence. This paper describes design and implementation issues related to arc diagrams and shows how they may be applied to visualize such diverse data as music, text, and compiled code.

Leif Wickland - October 20, 2005 - Google File System

Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

Zuzana Gedeon - October 13, 2005 - Machine learning and speech recognition for evil purposes

Keyboard Acoustic Emanations Revisited
Li Zhuang, Feng Zhou, and J. D. Tygar
University of California, Berkeley

We examine the problem of keyboard acoustic emanations. We present a novel attack taking as input a 10-minute sound recording of a user typing English text using a keyboard, and then recovering up to 96% of typed characters. There is no need for a labeled training recording. Moreover the recognizer bootstrapped this way can even recognize random text such as passwords: In our experiments, 90% of 5-character random passwords using only letters can be generated in fewer than 20 attempts by an adversary; 80% of 10-character passwords can be generated in fewer than 75 attempts. Our attack uses the statistical constraints of the underlying content, English language, to reconstruct text from sound recordings without any labeled training data. The attack uses a combination of standard machine learning and speech recognition techniques, including cepstrum features, Hidden Markov Models, linear classification, and feedback-based incremental learning.

Anthony Arnone - September 29, 2005 - Web Crawling

User-Centric Web Crawling
Sandeep Pandey and Christopher Olston
Carnegie Mellon University

Search engines are the primary gateways of information access on the Web today. Behind the scenes, search engines crawl the Web to populate a local indexed repository of Web pages, used to answer user search queries. In an aggregate sense, the Web is very dynamic, causing any repository of Web pages to become out of date over time, which in turn causes query answer quality to degrade. Given the considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not feasible for a search engine to maintain its repository exactly synchronized with the Web.
In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a usercentric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.

Neal Richter - September 22, 2005 - Language learning

Unsupervised Learning of Natural Languages
Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman,
Tel Aviv University, and Cornell University

We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

Steve Durbin - September 15, 2005 - Web text mining

The Predictive Power of Online Chatter
Daniel Gruhl, R. Guha*, Ravi Kumar, Jasmine Novak and Andrew Tomkins
IBM Almaden Research, *Google

An increasing fraction of the global discourse is migrating online in the form of blogs, bulletin boards, web pages, wikis, editorials, and a dizzying array of new collaborative technologies. The migration has now proceeded to the point that topics reflecting certain individual products are sufficiently popular to allow targeted online tracking of the ebb and flow of chatter around these topics. Based on an analysis of around half a million sales rank values for 2,340 books over a period of four months, and correlating postings in blogs, media, and web pages, we are able to draw several interesting conclusions. First, carefully hand-crafted queries produce matching postings whose volume predicts sales ranks. Second, these queries can be automatically generated in many cases. And third, even though sales rank motion might be difficult to predict in general, algorithmic predictors can use online postings to successfully predict spikes in sales rank.

Steve Durbin - September 8, 2005 - Robot swarms

Dynamic Task Assignment in Robot Swarms
James McLurkin and Daniel Yamins*
Computer Science and Artificial Intelligence Lab, MIT, *Dept. of Mathematics, Harvard University

A large group of robots will often be partitioned into subgroups, each subgroup performing a different task. This paper presents four distributed algorithms for assigning swarms of homogenous robots to subgroups to meet a specified global task distribution. Algorithm Random-Choice selects tasks randomly, but runs in constant time. Algorithm Extreme-Comm compiles a complete inventory of all the robots on every robot, runs quickly, but uses a great deal of communication. The CardDealer's algorithm assigns tasks to individual robots sequentially, using minimal communications but a great deal of time. The Tree-Recolor algorithm is a compromise between Extreme-Comm and Card-Dealer's, balancing communications use and running time. The three deterministic algorithms drive the system towards the desired assignment of subtasks with high accuracy. We implement the algorithms on a group of 25 iRobot SwarmBots, and collect and analyze performance data.

Photos and video of robot swarm behaviors

Neal Richter - August 18, 2005 - Michael Collins's Talk on NLP at the University of Washington.

Natural Language Processing Natural language processing offers a rich problem domain for machine learning approaches. Many NLP problems require the induction of a mapping that involves complex, discrete structures such as strings, labeled sequences, or trees. In this distinguished lecture, Michael Collins describes how 'large margin' methods in machine learning can be generalized to 'structured' problems found in NLP.

Speaker(s): Michael Collins, Massachusetts Institute of Technology
Subject: Natural Language processing
Series: CSE Colloquia - 2005
Produced by: University of Washington, December 7, 2005
Runtime: 00:58:16

Streaming Video at UWTV.org

Steve Durbin - August 4, 2005 - Personal information management

The Haystack Project is investigating approaches designed to let people manage their information in ways that make the most sense to them. Learn more at Haystack Project

The Re:Search Engine: Helping People Return to Information on the Web
Jaime Teevan

Re-finding information is commonly cited as a problem on the Web. One reason re-finding on the Web is difficult is that while people rely on a considerable amount of context to return to information (e.g., the original path taken to it), the Web makes no guarantee that the context will remain static. The Re:Search Engine is designed to help people return to information in the dynamic environment of the Web by maintaining consistency in the search results it returns across time. For example, if Connie, while looking to purchase a Global Positioning System, found several systems she liked via a search for GPS , she would expect to be able to use the same query to locate the exact same systems again. However, simply returning the original result list when she re-issues the query might omit newly available GPS systems that she would like to see. The ideal result list would contain both the systems Connie remembers having seen and high quality new systems. Because people tend to remember little of what is presented in a result list, when a person repeats a query, the Re:Search Engine can preserve what is remembered about the original result set while still presenting new information.

Neal Richter - May 19th and 26th, 2005 - Steven Wolfram's Talk at University of Michigan.

On October 8, 2003 Stephen Wolfram gave a talk sponsored by University of Michigan Center for Study of Complex Systems titled and about his (then new) book A New Kind of Science.
Wolfram.wmv - Windows Media Player Movie File

The foundational observation of "A New Kind of Science" is that very simple systems are very frequently capable of great complexity. While isolated examples of this have been known for millenia, no fundamental significance was attached to them.

The surprising fact is that if one just starts enumerating simple computational systems, one very quickly encounters great complexity. This is robust fact that does not depend on the details of the system in question, but rather seems to be a generic law of our universe.

The fact that complexity is in fact rather easy to generate suggests an entirely new approach to science. The traditional path of science is to seek out examples of behavior in the natural world and try to break them down into understandable pieces. But an alternative path is to just systematically enumerate simple abstract computational systems, starting with the very simplest, and try to understand what they do.

Links about the Book (and controversy surrounding it):

Steve Durbin - April 7, 2005 - Recommender systems

TiVo: Making Show Recommendations Using a Distributed Collaborative Filtering Architecture
Kamal Ali,* Wijnand van Stam**
*TiVo, Yahoo!; **TiVo

We describe the TiVo television show collaborative recommendation system which has been fielded in over one million TiVo clients for four years. Over this install base, TiVo currently has approximately 100 million ratings by users over approximately 30,000 distinct TV shows and movies. TiVo uses an item-item (show to show) form of collaborative filtering which obviates the need to keep any persistent memory of each user viewing preferences at the TiVo server. Taking advantage of TiVo client-server architecture has produced a novel collaborative filtering system in which the server does a minimum of work and most work is delegated to the numerous clients. Nevertheless, the server-side processing is also highly scalable and parallelizable. Although we have not performed formal empirical evaluations of its accuracy, internal studies have shown its recommendations to be useful even for multiple user households. TiVo architecture also allows for throttling of the server so if more server-side resources become available, more correlations can be computed on the server allowing TiVo to make recommendations for niche audiences.

Steve Durbin - March 31, 2005 - Subjectivity analysis

Learning Subjective Nouns using Extraction Pattern Bootstrapping
Ellen Riloff,* Janyce Wiebe,** Theresa Wilson**
*University of Utah, **University of Pittsburgh

We explore the idea of creating a subjectivity classifier that uses lists of subjective nouns learned by bootstrapping algorithms. The goal of our research is to develop a system that can distinguish subjective sentences from objective sentences. First, we use two bootstrapping algorithms that exploit extraction patterns to learn sets of subjective nouns. Then we train a Naive Bayes classifier using the subjective nouns, discourse features, and subjectivity clues identified in prior research. The boot-strapping algorithms learned over 1000 subjective nouns, and the subjectivity classifier performed well, achieving 77% recall with 81% precision.

Steve Durbin - March 24, 2005 - Social Information Foraging

Footprints: History-Rich Tools for Information Foraging
Alan Wexelblat and Pattie Maes
MIT Media Lab

Inspired by Hill and Hollan's original work, we have been developing a theory of interaction history and building tools to apply this theory to navigation in a complex information space. We have built a series of tools--map, paths, annotations and signposts--based on a physical-world navigation metaphor. These tools have been in use for over a year. Our user study involved a controlled browse task and showed that users were able to get the same amount of work done with significantly less effort.

Ken Dyke - March 10, 2005 - Computer Security & AI

Anomaly Detection of Web-based Attacks
By Christopher Kruegel & Giovanni Vigna
Reliable Software Group University of California, Santa Barbara

Web-based vulnerabilities represent a substantial portion of the security exposures of computer networks. In order to detect known web-based attacks, misuse detection systems are equipped with a large number of signatures. Unfortunately, it is difficult to keep up with the daily disclosure of web-related vulnerabilities, and, in addition, vulnerabilities may be introduced by installation-specific web-based applications. Therefore, misuse detection systems should be complemented with anomaly detection systems. This paper presents an intrusion detection system that uses a number of different anomaly detection techniques to detect attacks against web servers and web-based applications. The system correlates the serverside programs referenced by client queries with the parameters contained in these queries. The application-specific characteristics of the parameters allow the system to perform focused analysis and produce a reduced number of false positives. The system derives automatically the parameter profiles associated with web applications (e.g., length and structure of parameters) from the analyzed data. Therefore, it can be deployed in very different application environments without having to perform time-consuming tuning and configuration.

Neal Richter - March 3, 2005 - Search Engine Security

The Insecure Indexing Vulnerability Attacks Against Local Search Engines
By Amit Klein
Security Researcher

This paper describes several techniques (many of them new) for exposing file contents using the site search functionality. It is assumed that a site contains documents which are not visible/accessible to external users. Such documents are typically future PR items, or future security advisories, uploaded to the website beforehand. However, the site is also searchable via an internal search facility, which does have access to those documents, and as such, they are indexed by it not via web crawling, but rather, via direct access to the files (and therein lies the security breach).

Several attack techniques are described, some very simple and quick, while other require an enormous amount of traffic; not all attacks are relevant to a particular site, as they depend on the richness of syntax supported by the site's search engine.

The paper concludes with methods to detect insecure indexing vulnerability and suggested solutions.

Note that this attack is fundamentally different than exploitation of external (remote) search engines.

Desktop search engines threaten SSL VPN security
By Author Tim Greene

Discusses how new Desktop Search tools from Google and others are a security threat.

Matrix of Desktop Search Engine Indexing Capabilities

Steve Durbin - February 24, 2005 - Knowledge-based search

Exploiting a Thesaurus-Based Semantic Net for Knowledge-Based Search
Peter Clark, John Thompson, Heather Holmback, Lisbeth Duncan
Mathematics and Computing Technology, The Boeing Company

With the growth of on-line information, the need for better resource location services is growing rapidly. A popular goal is to conduct search in terms of concepts, rather than words; however, this approach is frequently thwarted by the high up-front cost of building an adequate ontology (conceptual vocabulary) in the first place. In this paper we describe a knowledge-based Expert Locator application (for identifying human experts relevant to a particular problem or interest), which addresses this issue by using a large, pre-built, technical thesaurus as an initial ontology, combined with simple AI techniques of search, subsumption computation, and language processing. The application has been deployed and in use in our local organization since June, 1999, and a second, larger application was deployed in March 2000. We present the Expert Locator and the AI techniques it uses, and then we evaluate and discuss the application. The significance of this work is that it demonstrates how years of work by library science in thesaurus-building can be leveraged using AI methods, to construct a practical resource location service in a short period of time.

Neal Richter and Steve Durbin - February 17, 2005 - Knowledge discovery from the Web

Automatic Meaning Discovery Using Google
Rudi Cilibrasi and Paul Vitanyi
National Research Institute for Mathematics and Computer Science, Netherlands

We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We demonstrate positive correlations, evidencing an underlying semantic structure, in both numerical symbol notations and number-name words in a variety of natural languages and contexts. Next, we demonstrate the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; the ability to understand electrical terms, religious terms, emergency incidents, and we conduct a massive experiment in understanding WordNet categories; the ability to do a simple automatic English-Spanish translation.

Doug Warner and Steve Durbin - February 10, 2005 - Search systems with swarm optimization

User Behavior in an Adaptive Search System
Doug Warner, Stephen D. Durbin, J. Neal Richter, Z. Gedeon
RightNow Technologies

We analyze usage of a search system employing both standard search techniques and Ant Colony Optimization (ACO) approaches. Searching behavior on the system is similar to that on Internet search engines, though low searching rates suggest facilitation due to the ACO methods. Simulations based on site and user models give insight into the adaptive behavior and roughly match observations.

Zuzana Gedeon - February 3, 2005 - Learning ordering

Learning to Order Things
William W. Cohen, Robert E. Schapire, and Yoram Singer
AT&T Labs

There are many applications in which it is desirable to order rather than classify instances. Here we consider the problem of learning how to order, given feedback in the form of preference judgments, i.e., statements to the effect that one instance should be ranked ahead of another. We outline a two-stage approach in which one first learns by conventional means a preference function, of the form PREF(u,v), which indicates whether it is advisable to rank u before v. New instances are then ordered so as to maximize agreements with the learned preference function. We show that the problem of finding the ordering that agrees best with a preference function is NP-complete, even under very restrictive assumptions. Nevertheless, we describe a simple greedy algorithm that is guaranteed to find a good approximation. We then discuss an on-line learning algorithm, based on the "Hedge" algorithm, for finding a good linear combination of ranking "experts." We use the ordering algorithm combined with the on-line learning algorithm to find a combination of "search experts," each of which is a domain-specific query expansion strategy for a WWW search engine, and present experimental results that demonstrate the merits of our approach.

Jeff Elser and John Paxton - January 27, 2005 - Optimizing a search engine

ht://dig performs three major tasks that should be performed in the following order:

  1. Digging
    Before you can search, a database of all the documents that need to be searched has to be created.
  2. Merging
    Merging consists of two processes:
    A. Converting the databases of all documents to specialized databases for simple, fast searching.
    B. Merging changed information into previously existing databases.
    Even though this task could be performed at the same time as the Digging, it is a separate process for efficiency reasons. This also allows for more control over the processes implemented when merging.
  3. Searching
    Finally, the databases that were created in the previous steps can be used for actual searches. Normally, searches will be invoked by a CGI (Common Gateway Interface; a program running on the webserver) which gets input from the user through an HTML form.

Discussion Topics:
(1) An overview of ht://dig
(2) an exploratory discussion of how machine learning might be used to configure ht://dig
Reading: Browse the ht://dig website, http://www.htdig.org/

Steve Durbin - January 20, 2005 - Bayesian bots

Teaching Bayesian Behaviours to Video Game Characters
Ronan Le Hy, Anthony Arrigonia, Pierre Bessiere, Olivier Lebeltel
INRIA, France

This article explores an application of Bayesian Programming to behaviours for synthetic video games characters. We address the problem of real-time reactive selection of elementary behaviours for an agent playing a first person shooter game. We show how Bayesian Programming can lead to condensed and easier formalisation of finite state machine-like behaviour selection, and lend itself to learning by imitation, in a fully transparent way for the player.

Steve Durbin - December 9, 2004 - Perceptions of randomness

Randomness and Coincidences: Reconciling Intuition and Probability Theory
Thomas L. Griffiths & Joshua B. Tenenbaum
Department of Psychology, Stanford University

We argue that the apparent inconsistency between people's intuitions about chance and the normative predictions of probability theory, as expressed in judgments about randomness and coincidences, can be resolved by focussing on the evidence observations provide about the processes that generated them, rather than their likelihood. This argument is supported by probabilistic modeling of sequence and number production, together with two experiments that examine people's judgments about coincidences.

Zuzana Gedeon - December 2, 2004 - Clustering impossibility theorem

An Impossibility Theorem for Clustering
Jon Kleinberg
Department of Computer Science, Cornell University

Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median.

Mike Emery and John Paxton - November 18, 2004 - Structure in multi-agent systems

Cougarr Agent Communities
Ronald D. Snyder and Douglas C. MacKenzie
Mobile Intelligence Corporation

The ability to organize agents into abstract groups provides a powerful tool for agent organization and communication in large multi-agent systems. In this paper we outline the fundamental characteristics required of a robust and scalable agent grouping mechanism. These characteristics are then discussed in the context of the Cougarr community infrastructure. This implementation of agent communities illustrates the use of distributed state management, distributed event propagation and abstract messaging in a high-performance agent architecture designed for robustness and scalability.

Steve Durbin - November 4, 2004 - Question-answering with soft text patterns

Unsupervised Learning of Soft Patterns for Generating Definitions from Online News
Hang Cui, Min-Yen Kan, and Tat-Seng Chua
Department of Computer Science, School of Computing, National University of Singapore

Breaking news often contains timely definitions and descriptions of current terms, organizations and personalities. We utilize such web sources to construct definitions for such terms. Previous work has identified definitions using hand-crafted rules or supervised learning that constructs rigid, hard text patterns. In contrast, we demonstrate a new approach that uses flexible, soft matching patterns to characterize definition sentences. Our soft patterns are able to effectively accommodate the diversity of definition sentence structure exhibited in news. We use pseudo-relevance feedback to automatically label sentences for use in soft pattern generation. The application of our unsupervised method significantly improves baseline systems on both the standardized TREC corpus as well as crawled online news articles by 27% and 30%, respectively, in terms of F measure. When applied to a state-of-art definition generation system recently fielded in the TREC 2003 definitional question answering task, it improves the performance by 14%.

Binhai Zhu - October 28, 2004 - Datamining in sparse databases

Efficient discovery of error-tolerant frequent itemsets in high dimensions
Cheng Yang
Dept. of Computer Science, Stanford University
Usama Fayyad and Paul S. Bradley
digiMine, Inc.

We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data (customer purchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering high-dimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.

Rafal Angryk, Steve Durbin - October 21, 2004 - Uncertain knowledge in databases and multi-agent communication

Current approaches to handling imperfect information in data and knowledge bases
Simon Parsons
Advanced Computation Laboratory, Imperial Cancer Research Fund

This paper surveys methods for representing and reasoning with imperfect information. It opens with an attempt to classify the different types of imperfection that may pervade data, and a discussion of the sources of such imperfections. The classification is then used as a framework for considering work that explicitly concerns the representation of imperfect information, and related work on how imperfect information may be used as a basis for reasoning. The work that is surveyed is drawn from both the field of databases and the field of artificial intelligence. Both of these areas have long been concerned with the problems caused by imperfect information, and this paper stresses the relationships between the approaches developed in each.

The Byzantine generals problem
Leslie Lamport, Robert Shostak, and Marshall Pease
SRI International

Reliable computer systems must handle malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement. It is shown that, using only oral messages, this problem is solvable if and only if more that two-thirds of the generals are loyal; so a single traitor can confound two loyal generals. With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. Applications of the solutions to reliable computer systems are then discussed.

Doug Warner - October 7, 2004 - Processing uncertain knowledge

How Humans Process Uncertain Knowledge: An Introduction for Knowledge Engineers
Robert F. Hink and David L. Woods

The question of how humans process uncertain information is important to the development of knowledge-based systems in terms of both knowledge acquisition and knowledge representation. This article reviews three bodies of psychological research that address this question: human perception, human probabilistic and statistical judgement, and human choice behavior. The general conclusion is that human behavior under uncertainty is often suboptimal and sometimes even fallacious. Suggestions for knowledge engineers in detecting and obviating such errors are discussed. The requirements for a system designed to reduce the effects of human factors in the processing of uncertain knowledge are introduced.

Steve Durbin - September 30, 2004 - Cognitive biases and bounded rationality

Chapter from the very readable CIA publication Psychology of Intelligence Analysis
Biases in Estimating Probabilities

Daniel Kahneman Nobel Prize lecture
Maps of Bounded Rationality: A Perspective on Intuitive Judgement and Choice

Multiple - September 23, 2004 - Evaluation methods

This week, multiple presenters will give short presentations on evaluation methods of their choice. These may be methods of evaluating competing algorithms, or methods of evaluating data so as to select the most appropriate algorithm for a particular task. One method, the ROC curve, is described at the sites below:

The Magnificent ROC
Interpreting Diagnostic Tests

Steve Durbin - September 16, 2004 - Machine learning for computer security

Learning to Detect Malicious Executables in the Wild
Jeremy Z. Kolter and Marcus A. Maloof
Department of Computer Science, Georgetown University

In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the ROC curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining.

Lou Glassy - May 6, 2004 - Missing Text Reconstruction

Public PhD Dissertation Defense, Computer Science Dept, Montana State University.
2:00pm in EPS 108


Neal Richter - April 22, 2004 - Expectation Maximization Algorithm

The Expectation Maximization Algorithm
Frank Dellaert
College of Computing, Georgia Institute of Technology

An intuitive tutorial of the "Expectation Maximization Algorithm". The EM algorithm is general method for finding a maximum-likelihood estimate of the parameters of a probability distribution from a data set with incomplete/missing values. The general EM is a description of a meta-algorithm, which is used to design a particular algorithm. It is used in pattern recognition and information retrieval communities for both unsupervised and semi-supervised learning algorithms as well as other research areas.

Steve Durbin - April 15, 2004 - Natural language interface

A Reliable Natural Language Interface to Household Appliances
Alexander Yates, Oren Etzioni, Daniel Weld
University of Washington, Computer Science

As household appliances grow in complexity and sophistication, they become harder and harder to use, particularly because of their tiny display screens and limited keyboards. This paper describes a strategy for building natural language interfaces to appliances that circumvents these problems. Our approach leverages decades of research on planning and natural language interfaces to databases by reducing the appliance problem to the database problem; the reduction provably maintains desirable properties of the database interface. The paper goes on to describe the implementation and evaluation of the EXACT interface to appliances, which is based on this reduction. EXACT maps each English user request to an SQL query, which is transformed to create a PDDL goal, and uses the Blackbox planner [13] to map the planning problem to a sequence of appliance commands that satisfy the original request. Both theoretical arguments and experimental evaluation show that EXACT is highly reliable.

Steve Durbin - April 8, 2004 - Affect sensing

What Would They Think? A Computational Model of Attitudes
Hugo Liu and Patti Maes
MIT Media Laboratory

A key to improving at any task is frequent feedback from people whose opinions we care about: our family, friends, mentors, and the experts. However, such input is not usually available from the right people at the time it is needed most, and attaining a deep understanding of someone else's perspective requires immense effort. This paper introduces a technological solution. We present a novel method for automatically modeling a person's attitudes and opinions, and a proactive interface called "What Would They Think?" which offers the just-in-time perspectives of people whose opinions we care about, based on whatever the user happens to be reading or writing. In the application, each person is represented by a "digital persona," generated from an automated analysis of personal texts ( e.g. weblogs and papers written by the person being modeled) using natural language processing and commonsense-based textual-affect sensing. In user studies, participants using our application were able to grasp the personalities and opinions of a panel of strangers more quickly and deeply than with either of two baseline methods. We discuss the theoretical and pragmatic implications of this research to intelligent user interfaces.

Neal Richter and Doug Warner - April 1, 2004 - Automatic Classification of Photographic Images

Automatic Classification of Photographic Images
Loof Lirpa and Sourd-muet D'âne
University of DummköpfeeFrühling and Université Polytechniques des Iimbécil

This paper demonstrates an automatic system for classifying photographic images according to genre. The system marks pixels using combined color, texture and temperature properties. These marked regions are then fed to a specialized classifier, which attempts to group similar objects using geometric constraints. If the classifier finds a sufficiently complex structure,the system decides an object type is present and labels the genre appropriately. The system shows excellent performance on a test set of 565 uncontrolled images of art, mostly obtained from the internet, and 4289 assorted control images, drawn from a wide variety of sources.

Other Reading

Bob Wall, Jeff Elser, ... - March 25, 2004 - Tutorial on Support Vector Machines

A Tutorial on nu-Support Vector Machines
Pai-Hsuen Chen*, Chih-Jen Lin*, and Bernhard Schölkopf**
*National Taiwan University and **Max Planck Institute for Biological Cybernetics

We briefly describe the main ideas of statistical learning theory, support vector machines (SVMs), and kernel feature spaces. We place particular emphasis on a description of the so-called nu-SVM, including details of the algorithm and its implementation, theoretical results, and practical applications.

Ken Dyke - March 18, 2004 - Stupid vs. Intelligent Networks

End-to-End Arguments in System Design
J.H. Saltzer, D.P. Reed and D.D. Clark
M.I.T. Laboratory for Computer Science

This paper presents a design principle that helps guide placement of functions among the modules of a distributed computer system. The principle, called the end-to-end argument, suggests that functions placed at low levels of a system may be redundant or of little value when compared with the cost of providing them at that low level. Examples discussed in the paper include bit error recovery, security using encryption, duplicate message suppression, recovery from system crashes, and delivery acknowledgement. Low level mechanisms to support these functions are justified only as performance enhancements.

Additional article online
Rise of the Stupid Network

Steve Durbin - March 11, 2004 - Faceted Metadata for Search

Flexible Search and Navigation using Faceted Metadata
Jennifer English, Marti Hearst, Rashmi Sinha, Kirsten Swearingen, Ka-Ping Yee
School of Information Management & Systems, University of California, Berkeley

We have developed an innovative search interface that allows non-expert users to move through large information spaces in a flexible manner without feeling lost. The design goal was to offer users a "browsing the shelves" experience seamlessly integrated with focused search. Key to achieving our goal is the explicit exposure of hierarchical faceted metadata in a manner that is intuitive and inviting to users. After several iterations of design and testing, the usability results are strikingly positive. We believe our approach marks a major step forward in search user interfaces and can serve as a model for web-based collections of up to 100,000 items.

Demo Search site
Fine Arts Search

Software demo and download site

Jeff Elser - March 4, 2004 - Classification with Support Vector Machines

Severe Storm Cell Classification Using Support Vector Machines and Radial Basis Function Approaches
L. Ramirez ^, W. Pedrycz ^, N. Pizzi ^^
^Dept. of Electrical and Computer Engineering, University of Alberta
^^Institute for Biodiagnostics, National Research Council, Canada

Meteorological volumetric data are used to detect thunderstorms that are the cause of most of the summer severe weathers. There are systems that may convert the volumetric data into a set of derived products. Based on these derived features, this work compares three classifiers to determine which approach will best classify a storm cell data set coming from Environment Canada. The criterion for comparison is the accuracy in the classification over a testing set. The three approaches compared are the support vector machine (SVM) classifier, with radial basis function (RBF) kernel; the classic RBF classifier, with the centres found using the orthogonal least squares approach; and the hybrid RBF, with the centres corresponding to the support vectors found using the SVM approach. The results show that the SVM approach is the best of these approaches, in terms of accuracy, for the storm cell classification.

Background paper on support vector machines
Support-Vector Networks
Corinna Cortes and Vladimir Vapnik
AT&T Research Labs

Chad Nybo - February 26, 2004 - Zero Intelligence Agents in Economics

The Predictive Power of Zero Intelligence in Financial Markets
J. Doyne Farmer [1], Paolo Patelli [1,2], Ilija I. Zovko [1,3]
[1] Santa Fe Institute; [2] Sant'Anna School of Advanced Studies, Pisa; [3] CENDEF, University of Amsterdam

Using data from the London Stock Exchange, we test a model that treats the statistical mechanics of price formation and the accumulation of stored supply and demand under the simple assumption that people place orders to trade at random. The model makes excellent predictions for transaction costs, price diffusion rates, and a quantity closely related to supply and demand. Thus, it appears that the price formation mechanism strongly constrains the market, playing a more important role than the strategic behavior of agents. The remarkable success of this approach suggests a new and unorthodox approach to economics.

Brief article in Nature Science Update
Stock market traders show signs of zero intelligence

Extra credit paper (4 pages)
High-Performance Bidding Agents for the Continuous Double Auction
Gerald Tesauro and Rajarshi Das
IBM T.J. Watson Research Center

Steve Durbin - February 19, 2004 - Using Expectation to Select Features

Using Expectation to Guide Processing: A Study of Three Real-World Applications
Shumeet Baluja
Justsystem Pittsburgh Research Center and School of Computer Science, Carnegie Mellon

In many real world tasks, only a small fraction of the available inputs are important at any particular time. This paper presents a method for ascertaining the relevance of inputs by exploiting temporal coherence and predictability. The method proposed in this paper dynamically allocates relevance to inputs by using expectations of their future values. As a model of the task is learned, the model is simultaneously extended to create task-specific predictions of the future values of inputs. Inputs which are either not relevant, and therefore not accounted for in the model, or those which contain noise, will not be predicted accurately. These inputs can be deemphasized, and, in turn, a new, improved, model of the task created. The techniques presented in this paper have yielded significant improvements for the vision-based autonomous control of a land vehicle, vision-based hand tracking in cluttered scenes, and the detection of faults in the etching of semiconductor wafers.

Doug Warner - February 12, 2004 - Link-Based Recommendation

On the Recommending of Citations for Research Papers
Sean M. McNee, Istvan Albert, Dan Cosley, Prateep Gopalkrishnan, Shyong K. Lam, Al Mamunur Rashid, Joseph A. Konstan, John Riedl
GroupLens Research Project, Dept. of Computer Science and Engineering, University of Minnesota

Collaborative filtering has proven to be valuable for recommending items in many different domains. In this paper, we explore the use of collaborative filtering to recommend research papers, using the citation web between papers to create the ratings matrix. Specifically, we tested the ability of collaborative filtering to recommend citations that would be suitable additional references for a target research paper. We investigated six algorithms for selecting citations, evaluating them through offline experiments against a database of over 186,000 research papers contained in ResearchIndex. We also performed an online experiment with over 120 users to gauge user opinion of the effectiveness of the algorithms and of the utility of such recommendations for common research tasks. We found large differences in the accuracy of the algorithms in the offline experiment, especially when balanced for coverage. In the online experiment, users felt they received quality recommendations, and were enthusiastic about the idea of receiving recommendations in this domain.

Steve Durbin - February 5, 2004 - Dynamical Systems and Cognition

Toward the Evolution of Dynamical Neural Networks for Minimally Cognitive Behavior
Randall D. Beer
Dept. of Computer Engineering and Science and Dept.of Biology, Case Western Reserve University

Current debates regarding the possible cognitive implications of ideas from adaptive behavior research and dynamical systems theory would benefit greatly from a careful study of simple model agents that exhibit minimally cognitive behavior. This paper sketches one such agent, and presents the results of preliminary experiments on the evolution of dynamical neural networks for visually-guided orientation, object discrimination and accurate pointing with a simple manipulator to objects appearing in its field of view.

Neal Richter - January 29, 2004 - Personal Search Engines

Stuff I've Seen: A System for Personal Information Retrieval and Re-Use
Susan Dumais, Edward Cutrell, JJ Cadiz, Gavin Jancke, Raman Sarin, Daniel C. Robbins, Microsoft Research.

Most information retrieval technologies are designed to facilitate information discovery. However, much knowledge work involves finding and re-using previously seen information. We describe the design and evaluation of a system, called Stuff I've Seen (SIS) , that facilitates information re-use. This is accomplished in two ways. First, the system provides a unified index of information that a person has seen, whether it was seen as email, web page, document, appointment, etc. Second, because the information has been seen before, rich contextual cues can be used in the search interface. The system has been used internally by more than 230 employees. We report on both qualitative and quantitative aspects of system use. Initial findings show that time and people are important retrieval cues. Users find information more easily using SIS, and use other search tools less frequently after installation.

Optional Short Articles

Neal Richter - January 22, 2004 - Topic Segmentation & Search Engines

SETS: Search Enhanced by Topic Segmentation
Mayank Bawa, Gurmeet Singh Manku, & Prabhakar Raghavan

We present SETS, an architecture for efficient search in peer-to-peer networks, building upon ideas drawn from machine learning and social network theory. The key idea is to arrange participating sites in a topic-segmented overlay topology in which most connections are short-distance, connecting pairs of sites with similar content. Topically focused sets of sites are then joined together into a single network by long-distance links. Queries are matched and routed to only the topically closest regions. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is efficient in network traffic and query processing load.

Steve Durbin - January 15, 2004 - Intelligent interfaces

Automatically Personalizing User Interfaces
Daniel S. Weld, Corin Anderson, Pedro Domingos, Oren Etzioni, Krzysztof Gajos, Tessa Lau, and Steve Wolfman
Dept. of Computer Science and Engineering, University of Washington

Today's computer interfaces are one-size-fits-all. Users with little programming experience have very limited opportunities to customize an interface to their task and work habits. Furthermore, the overhead induced by generic interfaces will be proportionately greater on small form-factor PDAs, embedded applications and wearable devices. Automatic personalization may greatly enhance user productivity, but it requires advances in customization (explicit, user-initiated change) and adaptation (interface-initiated change in response to routine user behavior). In order to improve customization, we must make it easier for users to direct these changes. In order to improve adaptation, we must better predict user behavior and navigate the inherent tension between the dynamism of automatic adaptation and the stability required in order for theuser to predict the computers behavior and maintain control. This paper surveys a decade's work on customization and adaptation at the University of Washington, distilling the lessons we have learned.

Steve Durbin - December 11, 2003 - Choices in Natural Language Generation

Experiments with discourse-level choices and readability
Sandra Williams, Ehud Reiter and Liesl Osman
Depts. of Computing Science and Medicine and Therapeutics, University of Aberdeen

This paper reports on pilot experiments that are being used, together with corpus analysis, in the development of a Natural Language Generation (NLG) system, GIRL (Generator for Individual Reading Levels). GIRL generates reports for individuals after a literacy assessment. We tested GIRL's output on adult learner readers and good readers. Our aim was to find out if choices the system makes at the discourse level have an impact on readability. Our preliminary results indicate that such choices do indeed appear to be important for learner readers. These will be investigated further in future larger-scale experiments. Ultimately we intend to use the results to develop a mechanism that makes discourse-level choices that are appropriate for individuals' reading skills.

Additional paper (easy read!):
Human Variation and Lexical Choice
Ehud Reiter and Somayaljulu Sripada

Jeff Elser, Leif Wickland, and Ashish Kapoor - December 4, 2003 - Beyond Turing Machines

Turing's Ideas and Models of Computation
Eugene Eberbach, Dina Goldin, and Peter Wegner
Computer Science Depts. of UMass, Univ. of Connecticut, and Brown University

The theory of computation that we have inherited from the 1960's focuses on algorithmic computation as embodied in the Turing Machine to the exclusion of other types of computation that Turing had considered. In this chapter we present new models of computation, inspired by Turing's ideas, that are more appropriate for today's interactive, networked, and embedded computing systems. These models represent super-Turing computation, going beyond Turing Machines and algorithms. We identify three principles underlying super-Turing computation (interaction with the world, infinity of resources, and evolution of system) and apply these principles in our discussion of the implications of super-Turing computation for the future of computer science.

Additional reference (~200 pages):
Study, Implementation, and Evolution of the Artificial Neural Networks Proposed by Alan M. Turing

Steve Durbin - November 20, 2003 - Context-sensitive computing

Out of context: Computer systems that adapt to, and learn from, context
Henry Lieberman and Ted Selker
MIT Media Lab

There is a growing realization that computer systems will need to be increasingly sensitive to their context. Traditionally, hardware and software were conceptualized as input/output systems: systems that took input, explicitly given to them by a human, and acted upon that input alone to produce an explicit output. Now, this view is seen as being too restrictive. Smart computers, intelligent agent software, and digital devices of the future will have to operate on data that are not explicitly given to them, data that they observe or gather for themselves. These operations may be dependent on time, place, weather, user preferences, or the history of interaction. In other words, context. But what, exactly, is context? We look at perspectives from software agents, sensors, and embedded devices, and also contrast traditional mathematical and formal approaches. We see how each treats the problem of context and discuss the implications for design of context-sensitive hardware and software.

Neal Richter - November 13, 2003 - Information Retrieval and Language Modeling

Challenges in Information Retrieval and Language Modeling
James Allan (editor), Jay Aslam, Nicholas Belkin, Chris Buckley, Jamie Callan, Bruce Croft (editor), Sue Dumais, Norbert Fuhr, Donna Harman, David J. Harper, Djoerd Hiemstra, Thomas Hofmann, Eduard Hovy, Wessel Kraaij, John Lafferty, Victor Lavrenko, David Lewis, Liz Liddy, R. Manmatha, Andrew McCallum, Jay Ponte, John Prager, Dragomir Radev, Philip Resnik, Stephen Robertson, Roni Rosenfeld, Salim Roukos, Mark Sanderson, Rich Schwartz, Amit Singhal, Alan Smeaton, Howard Turtle, Ellen Voorhees, Ralph Weischedel, Jinxi Xu, ChengXiang Zhai

Information retrieval (IR) research has reached a point where it is appropriate to assess progress and to define a research agenda for the next five to ten years. This report summarizes a discussion of IR research challenges that took place at a recent workshop. The attendees of the workshop considered information retrieval research in a range of areas chosen to give broad coverage of topic areas that engage information retrieval researchers. Those areas are retrieval models, cross-lingual retrieval, Web search, user modeling, filtering, topic detection and tracking, classification, summarization, question answering, metasearch, distributed retrieval, multimedia retrieval, information extraction, as well as testbed requirements for future work. The potential use of language modeling techniques in these areas was also discussed. The workshop identified major challenges within each of those areas.

Steve Durbin - November 6, 2003 - Text categorization using decision trees and rules

A Decision-Tree-Based Symbolic Rule Induction System for Text Categorization
David E. Johnson, Frank J. Oles, Tong Zhang, and Thilo Goetz
IBM T. J. Watson Research Center

We present a decison-tree-based symbolic rule induction system whose purpose is to categorize text documents automatically. Our method for rule induction involves the novel combination of (1) a fast decision tree induction algorithm especially suited to text data and (2) a new method for converting a decision tree to a rule set that is simplified, but still logically equivalent to the original tree. We report experimental results on the use of this system on some practical problems.

Steve Durbin - October 30, 2003 - Search and prediction of social networks

Identity and Search in Social Networks
Duncan J. Watts*#$, Peter Sheridan Dodds#, M. E. J. Newman$
*Department of Sociology, Columbia University; #Columbia Earth Institute, Columbia University; $Santa Fe Institute

Social networks have the surprising property of being "searchable": Ordinary people are capable of directing messages through their network of acquaintances to reach a specific but distant target person in only a few steps. We present a model that offers an explanation of social network searchability in terms of recognizable personal identities: sets of characteristics measured along a number of social dimensions. Our model defines a class of searchable networks and a method for searching them that may be applicable to many network search problems, including the location of data files in peer-to-peer networks, pages on the World Wide Web, and information in distributed databases.

The Link Prediction Problem for Social Networks
David Liben-Nowell* and Jon Kleinberg#
*Laboratory for Computer Science, MIT; #Department of Computer Science, Cornell University

Given a snapshot of a social network, can we infer which new interactions among its members are likely to occur in the near future? We formalize this question as the link prediction problem, and develop approaches to link prediction based on measures of the "proximity" of nodes in a network. Experiments on large co-authorship networks suggest that information about future interactions can be extracted from network topology alone, and that fairly subtle measures for detecting node proximity can outperform more direct measures.

See also previous papers related to small-world networks:
Introduction to small-world networks

Zuzana Gedeon - October 23, 2003 - Recursive Partitioning with Multiple Trees

Extracting Representative Tree Models
Hugh A. Chipman, Edward I. George and Robert E. McCulloch
Department of Statistics and Actuarial Science, University of Waterloo

A common criticism of many methods for constructing tree models is that a single tree or nested sequence of trees is produced, and that much uncertainty about the tree structure is ignored. Recent search algorithms (bumping, boosting, simulated annealing, MCMC) address this problem by finding a much richer collection of trees. They lead to an embarrassment of riches, in that it may be difficult to make sense of the resultant forest. Quite often, the problem may not be as bad as it seems: although hundreds of distinct trees are identified, many will differ only at a few nodes. Other trees may have different topologies, but produce similar partitions of the predictor space. By defining several distance metrics on trees, we summarize a forest of trees by several representative trees and associated clusters. A new plot, the added tree plot is introduced as a means to decide how many trees to examine while simultaneously adjusting for the goodness-of-fit of the trees considered.

There's also a useful description of the method in the User's Manual for
FIRM (Formal Inference-based Recursive Modeling) package for recursive partitioning

Ashish Kapoor - October 16, 2003 - Genetic Algorithm for VLSI Layout

A Genetic Algorithm for Macro Cell Placement
Henrik Esbensen
Computer Science Department, Aarhus University

A new genetic algorithm for the macro cell placement problem is presented. The algorithm is based on a generalization of the two-dimensional bin packing problem. The genetic encoding of a macro cell placement and the corresponding genetic operators are described. The algorithm has been tested on MCNC benchmarks, and the quality of the produced placements are comparable to the best published results.

Longer background paper on VLSI design:
VLSI Cell Placement Techniques
K. Shahookar and P. Mazumder
Department of Electrical Engineering and Computer Sc~ence
University of Michigan

VLSI cell placement problem is known to be NP complete. A wide repertoire of heuristic algorithms exists in the literature for efficiently arranging the logic cells on a VLSI chip. The objective of this paper is to present a comprehensive survey of the various cell placement techniques, with emphasis on standard ce11and macro placement. Five major algorithms for placement are discussed: simulated annealing, force-directed placement, rein-cut placement, placement by numerical optimization, and evolution-based placement. The first two classes of algorithms owe their origin to physical laws, the third and fourth are analytical techniques, and the fifth class of algorithms is derived from biological phenomena. In each category, the basic algorithm is explained with appropriate examples. Also discussed are the different implementations done by researchers.

Neal Richter - October 9, 2003 - Machine Learning: Rule Learning and Extraction

Rule Learning and Extraction (Lecture Slides)
William H Hsu
Department of Computing and Information Sciences, Kansas State Univ.

This week's Colloquium will cover material out of Chapter 10 of Tom Mitchell's Machine Learning book. The slides linked are from a course based upon this book at Kansas State University.

All Lecture Slides from Machine Learning and Pattern Recognition, Fall 2001, Prof. William H. Hsu

Text Categorization and Relational Learning (1995)
William W. Cohen

We evaluate the first order learning system FOIL on a series of text categorization problems. It is shown that FOIL usually forms classifiers with lower error rates and higher rates of precision and recall with a relational encoding than with a propositional encoding. We show that FOIL's performance can be improved by relation selection, a first order analog of feature selection. Relation selection improves FOIL's performance as measured by any of recall, precision, F-measure, or error rate. With an appropriate level of relational selection, FOIL appears to be compedative with or superior to existing propositional techniques.

Jeff Elser - October 2, 2003 - Fuzzy controllers

Some Crisp Thoughts on Fuzzy Logic
Daniel Abramovitch
Hewlett-Packard Laboratories

In the past year I have been inundated with articles on fuzzy logic as well as encouraged to use it for control systems. After reading some articles on fuzzy logic control, listening to a seminar by Zadeh, and attending a one day course on Intelligent Control, I started forming an opinion about how fuzzy logic control works. I believe that there are some fundamental pieces of information not provided in most fuzzy logic control papers. When one realizes what those pieces of information are, one gets a different opinion about how and when fuzzy logic control works and when it is more practical than conventional control. I will first state some opinions on fuzzy logic and try to justify them. Once this is done, I will return to some of the articles written by proponents of fuzzy logic and use the previous understanding to shed some light on what is really responsible for the improved system performance.

Background reading on fuzzy logic:
Introduction to Fuzzy Control

See also an earlier paper about fuzzy clustering:
Mining Web Access Logs Using Relational Competitive Fuzzy Clustering

Zuzana Gedeon - September 25, 2003 - Improving PageRank

Topic-Sensitive PageRank
Taher H. Haveliwala
Computer Science Department, Stanford University

In the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector is computed, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. By using these (precomputed) biased PageRank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. For ordinary keyword search queries, we compute the topic-sensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topic-sensitive PageRank scores using the topic of the context in which the query appeared.

See also last week's paper for discussion related to PageRank..

Steve Durbin - September 18, 2003 - Link analysis to improve searching

Implicit Link Analysis for Small Web Search
G.-R. Xue et al
Computer Science and Engineering, Shanghai Jiao-Tong University
and Microsoft Research Asia

Current Web search engines generally impose link analysis-based re-ranking on web-page retrieval. However, the same techniques, when applied directly to small web search such as intranet and site search, cannot achieve the same performance because their link structures are different from the global Web. In this paper, we propose an approach to constructing implicit links by mining users' access patterns, and then apply a modified PageRank algorithm to re-rank web-pages for small web search. Our experimental results indicate that the proposed method outperforms content-based method by 16%, explicit link-based PageRank by 20% and DirectHit by 14%, respectively.

See also an earlier paper about Google and its PageRank algorithm
The Anatomy of a Large-Scale Hypertextual Web Search Engine

Dustin Lee - September 11, 2003 - Information markets

Avoiding Moral Hazards in Organizational Forecasting
Tad Hogg and Bernardo A. Huberman
Hewlett-Packard Labs

We describe a new mechanism that induces accurate forecasts within an organization while reducing moral hazards and the stigma associated with negative opinions. It is based on the notion of identity escrow, whereby the identity of a forecaster is kept anonymous and only revealed when a number of his subordinates detect an attitude that is contrary to the interests of the organization. An analysis of the relative payoffs between forecasting and production shows that through the use of identity escrows one can adjust the size of the prediction group so as to ensure both production and accurate forecasts.

See also a Kuro5hin op-ed piece about the recently defunct Policy Analysis Market
The Policy Analysis Market: Why It is a Great Idea

Louis Glassy - August 21, 2003 - Missing text reconstruction

Missing-Text Reconstruction
Louis Glassy
Department of Computer Science, Montana Tech

Missing-text reconstruction (MTR) is a new application of text-oriented pattern recognition. The goal of MTR is to fully reconstruct documents in which fragments of original text are missing. Using context stored as n-gram models of the document's source language, the current MTR system makes sets of hypotheses of the missing text, and combines these sets, using Dempster's Rule of Combination, to form the best-supported reconstruction of the missing text. A software system (MITRE) was developed to act as a proof-of-concept for the MTR techniques discussed.

See also a previous colloquium Missing text reconstruction

Steve Durbin - July 31, 2003 - Machine translation

Statistical Phrase-Based Translation
Philipp Koehn, Franz Josef Och, Daniel Marcu
Department of Computer Science, University of Southern California

We propose a new phrase-based translation model and decoding algorithm that enables us to evaluate and compare several, previously proposed phrase-based translation models. Within our framework, we carry out a large number of experiments to understand better and explain why phrase-based models out-perform word-based models. Our empirical results, which hold for all examined language pairs, suggest that the highest levels of performance can be obtained through relatively sim-ple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translations. Surprisingly, learning phrases longer than three words and learning phrases from high-accuracy wordlevel alignment models does not have a strong impact on performance. Learning only syntactically motivated phrases degrades the performance of our systems.

See also the recent Slashdot story Romancing The Rosetta Stone

Zuzana Gedeon - July 24, 2003 - STL

To go back to more of a tutorial, than regular colloquium next week we will cover Standard Template Library. We will should cover some basics and discuss advantages/disadvantages of using STL in production code. If you have any insights into this area please come and share. Some recommended reading/tutorials can be found at:

SGI's very own Standard Template Library Programmer's Guide

Doug Warner - July 10, 2003 - Software patents

Patents for Software-Related Inventions
Jeffrey R. Kuester and Ann K. Moceyunas

This paper was written in March of 1995 by Jeffrey R. Kuester and Ann K. Moceyunas for the purpose of Jeff teaching a college class on software patents. Jeff is a partner with Thomas, Kayden, Horstemeyer & Risley, L.L.P., a patent, copyright, and trademark law firm in Atlanta, Georgia, and his practice focuses on software, electrical, and telecommunication patent law.

Patently Absurd
James Gleick

Once the province of a nuts-and-bolts world, patents are now being applied to thoughts and ideas in cyberspace. It's a ridiculous phenomenon and a nightmare for e-commerce.

Steve Durbin - July 3, 2003 - Mining user behavior to improve searching

Optimizing Search Engines using Clickthrough Data
Thorsten Joachims
Department of Computer Science, Cornell University

This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them difficult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples.

Collaborative_Crawling: Mining User Experiences for Topical Resource Discovery
Charu Aggarwal
IBM Research

The rapid growth of the world wide web had made the problem of topic specific resource discovery an important one in recentyears. In this problem, it is desired to find web pages which satisfy a predicate specified by the user. Such a predicate could be a keyword query, a topical query, or some arbitrary contraint. Several techniques such as focussed crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers are linkage based, since they use the hyperlink behavior in order to perform resource discovery. Recent studies have shown that the topical correlations in hyperlinks are quite noisy and may not always show the consistency necessary for a reliable resource discovery process. In this paper, we will approach the problem of resource discovery from an entirely different perspective; we will mine the significant browsing patterns of world wide web users in order to model the likelihood of web pages belonging to a specified predicate. This user behavior can be mined from the freely available traces of large public domain proxies on the world wide web. We refer to this technique as collaborative crawling because it mines the collective user experiences in order to find topical resources. Such a strategy is extremely effective because the topical consistency in world wide web browsing patterns turns out to very reliable. In addition, the user-centered crawling system can be combined with linkage based systems to create an overall system whichworks more effectively than a system based purely on either user behavior or hyperlinks.

Steve Durbin - June 26, 2003 - Content-based recommender system

Building Recommender Systems using a Knowledge Base of Product Semantics
Rayid Ghani and Andrew Fano
Accenture Technology Labs

Online retailers have access to large amounts of transactional data but current recommender systems tend to be short-sighted in nature and usually focus on the narrow problem of pushing a set of closely related products that try to satisfy the user's current need. Most ecommerce recommender systems analyze a large amount of transactional data without actually having any idea of what the items in the transactions mean or what they say about the customers who purchased or browsed those items. In this paper, we present a case study of a system that recommends items based on a custom-built knowledge base that consists of products and associated semantic attributes. Our system first extracts semantic features that characterize the domain of interest, apparel products in our case, using text learning techniques and populates a knowledge base with these products and features. The recommender system analyzes descriptions of products that the user browses or buys and automatically infers these semantic attributes to build a model of the user. This abstraction allows us to not only recommend other items in the same class of products that "match" the user model but also gives us the ability to understand the customer's "tastes" and recommend items across categories for which traditional collaborative filtering and contentbased systems are unsuitable. Our approach also allows us to "explain" the recommendations in terms of qualitative features which, we believe, enhances the user experience and helps build the user's confidence in the recommendations.

Steve Durbin - June 19, 2003 - Data mining challenges and cautions

CoIL Challenge 2000 Tasks and Results: Predicting and Explaining Caravan Policy Ownership
Peter van der Putten (1,2), Michel de Ruiter (1), Maarten van Someren (3)(1)
Sentient Machine Research, Amsterdam
Leiden Institute of Advanced Computer Science, Leiden University
Sociaal Wetenschappelijke Informatica, Universiteit van Amsterdamm

We present the problem tasks of the CoIL Challenge as they were explained to the participants. Furthermore, a general overview is given of the Challenge results.

COIL CHALLENGE 2000 ENTRY (prediction winner)
Charles Elkan
Department of Computer Science and Engineering
University of California, San Diego

Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000
Charles Elkan
Department of Computer Science and Engineering
University of California, San Diego

CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in datamining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very time-consuming model search. In either case, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering useful correlations.

Business-oriented data mining competitions:
CoIL Challenge 2000
KDD Cup 1998
KDD Cup 2000

Lou Glassy - May 1, 2003 - Missing text reconstruction

The basic idea in missing text reconstruction is simple: Based on a corpus of training text, construct n-gram tables which will represent a statistical model of the natural language of the corpus. Given a document with "holes" -- missing text -- use the statistical model represented by the n-grams to reconstruct ("fill in") the holes in the document.

Links on Dempster-Shafer theory:
Dempster-Shafer Theory of Evidence
Dempster-Shafer Theory

Link on information theory:
Introduction to the Information Theory

Online book on information theory:
A Short Course in Information Theory: 8 lectures by David J.C. MacKay

Neal Richter - April 24, 2003 - Genetic Algorithms

Adaptive and Self-adaptive Evolutionary Computations
Peter J. Angeline
Loral Federal Systems

This paper reviews the various studies that have introduced adaptive and self-adaptive parameters into Evolutionary Computations. A formal definition of an adaptive evolutionary computation is provided with an analysis of the types of adaptive and self-adaptive parameter update rules currently in use. Previous studies are reviewed and placed into a categorization that helps to illustrate their similarities and differences.

Theory of Evolutionary Algorithms: A Birds Eye View
A. E. Eiben and G. Rudolph
Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands
Department of Computer Science, Dortmund University, Germany

In this paper we consider the most important questions, research topics and technical tools used in various branches of evolutionary algorithms. The road map we give is to facilitate the readers' orientation in evolutionary computation theory. In the meanwhile, this survey provides key references for further study and evidence that the field of evolutionary computation is maturing rapidly, having many important results and even more interesting challenges.

Steve Durbin - April 10, 2003 - Mental Models of Search Engines

Transparent queries: Investigating Users' Mental Models of Search Engines
Jack Muramatsu and Wanda Pratt
Information and Computer Science, Univ. of California, Irvine

Typically, commercial Web search engines provide very little feedback to the user concerning how a particular query is processed and interpreted. Specifically, they apply key query transformations without the user's knowledge. Although these transformations have a pronounced effect on query results, users have very few resources for recognizing their existence and understanding their practical importance. We conducted a user study to gain a better understanding of users' knowledge of and reactions to the operation of several query transformations that web search engines automatically employ. Additionally, we developed and evaluated Transparent Queries, a software system designed to provide users with lightweight feedback about opaque query transformations. The results of the study suggest that users do indeed have difficulties understanding the operation of query transformations without additional assistance. Finally, although transparency is helpful and valuable, interfaces that allow direct control of query transformations might ultimately be more helpful for end-users.

Essay by Don Norman on "real world" design in industry:
Applying the Behavioral, Cognitive, and Social Sciences to Products

Ashish Kapoor - April 3, 2003 - Chatbots

Agents with Faces: The Effects of Personification of Agents
Tomoko Koda and Pattie Maes
MIT Media Laboratory

It is still an open question whether software agents should be personified in the interface. How should faces be used to make the agent more usable and likable? In order to study the effect of faces and facial expressions, we compared subjects' responses to and evaluation of different faces and facial expressions of entertaining agents. The result shows that having faces and facial expressions is likable and engaging. It takes users' effort to interpret the meaning the faces. In terms of facial features, realistic faces are better liked and rated as more intelligent than abstract faces.

Supplementary papers:
Noah and the Turing Test
A tutorial on natural language processing

Introduction with links:
Intelligent Agents: A Primer

Steve Durbin - March 27, 2003 - Learning agents in computational game theory

Learning dynamics in social dilemmas
Michael W. Macy*# and Andreas Flache##
*Department of Sociology, Cornell University
##Interuniversity Center for Social Science Theory and Methodology, University of Groningen, The Netherlands

The Nash equilibrium, the main solution concept in analytical game theory, cannot make precise predictions about the outcome of repeated mixed-motive games. Nor can it tell us much about the dynamics by which a population of players moves from one equilibrium to another. These limitations, along with concerns about the cognitive demands of forward-looking rationality, have motivated efforts to explore backward-looking alternatives to analytical game theory. Most of the effort has been invested in evolutionary models of population dynamics. We shift attention to a learning-theoretic alternative. Computational experiments with adaptive agents identify a fundamental solution concept for social dilemmas—stochastic collusion—based on a random walk from a self-limiting noncooperative equilibrium into a self-reinforcing cooperative equilibrium. However, we show that this solution is viable only within a narrow range of aspiration levels. Below the lower threshold, agents are pulled into a deficient equilibrium that is a stronger attractor than mutual cooperation. Above the upper threshold, agents are dissatisfied with mutual cooperation. Aspirations that adapt with experience (producing habituation to stimuli) do not gravitate into the window of viability; rather, they are the worst of both worlds. Habituation destabilizes cooperation and stabilizes defection. Results from the two-person problem suggest that applications to multiplex and embedded relationships will yield unexpected insights into the global dynamics of cooperation in social dilemmas.

Additional papers:
Home page of Michael Macy

An introduction to game theory:
Game Theory

Dustin Lee - March 20, 2003 - Hidden Markov models for sequence analysis

Hidden Markov models for sequence analysis: extension and analysis of the basic method
Richard Hughey and Anders Krogh
Computer Engineering, Univ. of Calif., Santa Cruz and NORDITA, Denmark

Abstract Hidden Markov models (HMMs) are a highly effective means of modeling a family of unaligned sequences or a common motif within a set of unaligned sequences. The trained HMM can then be used for discrimination or multiple alignment. The basic mathematical description of an HMM and its expectation-maximization training procedure is relatively straight-forward. In this paper, we review the mathematical extensions and heuristics that move the method from the theoretical to the practical. Then, we experimentally analyze the effectiveness of model regularization, dynamic model modification, and optimization strategies. Finally it is demonstrated on the SH2 domain how a domain can be found from unaligned sequences using a special model type. The experimental work was completed with the aid of the Sequence Alignment and Modeling software suite.

Additional papers:
UCSC's Sequence Alignment and Modeling System

Tutorial on hidden Markov models:
Introduction to Hidden Markov Models

Steve Durbin - March 6, 2003 - Dynamic Bayesian networks for an adaptive user interface

Empirically Grounded Decision-Theoretic Adaptation to Situation-Dependent Resource Limitations
Thorsten Bohnenberger, Boris Brandherm, Barbara Grossmann-Hutter, Dominik Heckmann, Frank Wittig
Department of Computer Science, Saarland University, Germany

This article summarizes research on several interrelated general issues that can arise in the design and development of user modeling systems: the learning and subsequent adaptation of general user models on the basis of empirical data; the modeling of temporally variable properties of users, in particular time pressure and cognitive load; and the user-adaptive planning of interactions under uncertainty. The methods and results are integrated and illustrated with a prototype of a mobile assistance system for travelers in an airport.

Additional papers:
READY project publications

Tutorial on Bayesian networks:
A Brief Introduction to Graphical Models and Bayesian Networks

Neal Richter - February 20, 2003 - Web Search Engines

A Taxonomy of Web Search
Andrei Broder
IBM Research

Classic IR (information retrieval) is inherently predicated on users searching for information, the so-called "information need". But the need behind a web search is often not informational—it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). We explore this taxonomy of web searches and discuss how global search engines evolved to deal with web-specific needs.

Challenges in Web Search Engines
Monika R. Henzinger, Rajeev Motwani, and Craig Silverstein
Google & Stanford University

This article presents a high-level discussion of some problems in information retrieval that are unique to web search engines. The goal is to raise awareness and stimulate research in these areas.

John Paxton - February 20, 2003 - Machine Learning for Sequential Data

Machine Learning for Sequential Data: : A Review
Thomas G. Dietterich
Oregon State University

Statistical learning problems in many elds involve sequential data. This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for addressing these problems. These methods include sliding window methods, recurrent sliding windows, hidden Markov models, conditional random elds, and graph transformer networks. The paper also discusses some open research issues.

Steve Durbin - February 13, 2003 - Reconciling ontologies

Learning to map between ontologies on the Semantic Web
AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy
Computer Science and Engineering, University of Washington

Ontologies play a prominent role on the Semantic Web. They make possible the widespread publication of machine understandable data, opening myriad opportunities for automated information processing. However, because of the Semantic Web's distributed nature, data on it will inevitably come from many different ontologies. Information processing across ontologies is not possible without knowing the semantic mappings between their elements. Manually finding such mappings is tedious, error-prone, and clearly not possible at the Web scale. Hence, the development of tools to assist in the ontology mapping process is crucial to the success of the Semantic Web.

Zuzana Gedeon - February 6, 2003 - Association rules in Recommender Algorithms

Evaluation of Recommender Algorithms for an Internet Information Broker Based on Simple Association Rules and on the Repeat-Buying Theory
Andreas Geyer-Schulz, Michael Hahsler

Association rules are a widely used technique to generate recommendations in commercial and research recommender systems. Since more and more Web sites, especially of retailers, offer automatic recommender services using Web usage mining, evaluation of recommender algorithms becomes increasingly important. In this paper we first present a framework for the evaluation of different aspects of recommender systems based on the process of discovering knowledge in databases of Fayyad et al. and then...

Neal Richter - January 23, 2003 - Language Trees and Zipping

Language Trees and Zipping
Dario Benedetto, Emanuele Caglioti, Vittorio Loreto
Physics & Mathemetics Departments, La Sapienza University, Rome, Italy

In this letter we present a very general method to extract information from a generic string of characters, e.g. a text, a DNA sequence or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution and language classification.

Extended Comment on Language Trees and Zipping
Joshua Goodman
Microsoft Research

This is the extended version of a Comment submitted to Physical Review Letters. I first point out the inappropriateness of publishing a Letter unrelated to physics. Next, I give experimental results showing that the technique used in the Letter is 3 times worse and 17 times slower than a simple baseline. And finally, I review the literature, showing that the ideas of the Letter are not novel. I conclude by suggesting that Physical Review Letters should not publish Letters unrelated to physics.

On J. Goodman's comment to Language Trees and Zipping
Dario Benedetto, Emanuele Caglioti, Vittorio Loreto
Physics & Mathemetics Departments, La Sapienza University, Rome, Italy

Motivated by the recent submission to cond-mat archives by J. Goodman whose results apparently discredit the approach we have proposed in a recent paper (Phys. Rev. Lett. 88, 048702 (2002)), we report the results of the same experiment performed by Goodman using three different data compression schemes. As a matter of fact the three zippers display the same efficiency Goodman obtained using Naive Bayesian Methods and not, as Goodman claimed, an efficiency three times smaller. We point out the question of the extreme generality of approaches based on data compression techniques and we list a large range of potential applications, including those of interest for the physics community.

Steve Durbin - January 23, 2003 - Rule extraction from neural nets

Understanding Time-Series Networks: A Case Study in Rule Extraction
Mark W. Craven and Jude W. Shavlik
Computer Sciences Department, University of Wisconsin-Madison

A significant limitation of neural networks is that the representations they learn are usually incomprehensible to humans. We have developed an algorithm, called Trepan, for extracting comprehensible, symbolic representations from trained neural networks. Given a trained network, Trepan produces a decision tree that approximates the concept represented by the network. In this article, we discuss the application of Trepan to a neural network trained on a noisy time series task: predicting the Dollar-Mark exchange rate. We present experiments that show that Trepan is able to extract a decision tree from this network that equals the network in terms of predictive accuracy, yet provides a comprehensible concept representation. Moreover, our experiments indicate that decision trees induced directly from the training data using conventional algorithms do not match the accuracy nor the comprehensibility of the tree extracted by Trepan.

Steve Durbin - December 13, 2002 - Data cleansing

Real-world data is dirty: Data cleansing and the merge/purge problem
Mauricio A. Hernandez and Salvatore J. Stolfo
Department of Computer Science, Columbia University

The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent "equational theory" that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing thes Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a "real-world" database that conclusively validates our results previously achieved for statistically generated data.

Steve Durbin - November 22, 2002 - Improved information retrieval for short queries

Impact Transformation: Effective and Efficient Web Retrieval
Vo Ngoc Anh and Alistair Moffat
Department of Computer Science and Software Engineering, University of Melbourne

We extend the applicability of impact transformation, which is a technique for adjusting the term weights assigned to documents so as to boost the effectiveness of retrieval when short queries are applied to large document collections. In conjunction with techniques called quantization and thresholding, impact transformation allows improved query execution rates compared to traditional vector-space similarity computations, as the number of arithmetic operations can be reduced. The transformation also facilitates a new dynamic query pruning heuristic. We give results based upon the TREC web data that show the combination of these various techniques to yield highly competitive retrieval, in terms of both effectiveness and efficiency, for both short and long queries.

Higher Precision for Two-Word Queries
K. L. Kwok, C. S. Dept., Queens College, CUNY

Queries have specific properties, and may need individualized methods and parameters to optimize retrieval. Length is one property. We look at how two-word queries may attain higher precision by re-ranking using word co-occurrence evidence in retrieved documents. Co-occurrence within document context is not sufficient, but window context including sentence context evidence can provide precision improvements at low recall region of 4 to 10% using initial retrieval results, and positively affects pseudo-relevance feedback.

Zuzana Gedeon - November 1, 2002 - Cluster validity methods

Cluster validity methods: part I
Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis & Athens University of Economics and Business

Clustering is an unsupervised process since there are no predefined classes and no examples that would indicate grouping properties in the data set. The majority of the clustering algorithms behave differently depending on the features of the data set and the initial assumptions for defining groups. Therefore, in most applications the resulting clustering scheme requires some sort of evaluation as regards its validity. Evaluating and assessing the results of a clustering algorithm is the main subject of cluster validity. In this paper we present a review of the clustering validity and methods. More specifically, Part I of the paper discusses the cluster validity approaches based on external and internal criteria.

Cluster validity methods: part II (handed out: no online version yet)
Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis & Athens University of Economics and Business

Clustering results validation is an important topic in the context of pattern recognition. We review approaches and systems in this context. In the current part, we present a review of clustering validity approaches based on relative criteria. Also we discuss the results of experimental study based on widely known validity indices. Finally th epaper illustrates issues that are under-addressed by th erecent approaches and proposes teh research directions in the field.

Neal Richter - October 25, 2002 - Random Decision Forests

Random Decision Forests
Tin Kam Ho AT&T Bell Laboratories

Decision trees are attractive classifiers due to their high execution speed. But trees derived with traditional methods often cannot be grown to arbitrary complexity for possible loss of generalization accuracy on unseen data. The limitation on complexity usually means suboptimal accuracy on training data. Following the principles of stochastic modeling, we propose a method to construct tree-based classifiers whose capacity can be arbitrarily expanded for increases in accuracy for both training and unseen data. The essence of the method is to build multiple trees in randomly selected subspaces of the feature space. Trees in different subspaces generalize their classification in complementary ways, and their combined classification can be monotonically improved. The validity of the method is demonstrated through experiments on the recognition of handwritten digits.

C4.5 Decision Forests
Tin Kam Ho AT&T Bell Laboratories

Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achievingmaximum accuracy is seldom resolved. We propose a method to construct a decision tree based classi-fier that maintains highest accuracy on training data and improves on generalization accuracy as it grows incomplexity. Trees are generated using the well-known C4.5 algorithm, and the classifier consists of multiple trees constructed in pseudo-randomly selected subspaces of the given feature space. We compare themethod to single-tree classifiers and other forest construction methods by experiments on four public datasets, where the method's superiority is demonstrated. A measure is given to describe the similarity betweentrees in a forest, and is related to the combined classification accuracy.

Steve Durbin - August 16, 2002 - Time series prediction

Noisy time series prediction using a recurrent neural network and grammatical inference
C. Lee Giles, S. Lawrence, A. C. Tsoi*
NEC Research Institute and *University of Wollongong, Australia

Financial forecasting is an example of a signal processing problem which is challenging due to small sample sizes, high noise, non-stationarity, and non-linearity. Neural networks have been very successful in a number of signal processing applications. We discuss fundamental limitations and inherent difficulties when using neural networks for the processing of high noise, small sample size signals. We introduce a new intelligent signal processing method which addresses the difficulties. The method proposed uses conversion into a symbolic representation with a self-organizing map, and grammatical inference with recurrent neural networks. We apply the method to the prediction of daily foreign exchange rates, addressing difficulties with non-stationarity, overfitting, and unequal a priori class probabilities, and we find significant predictability in comprehensive experiments covering 5 different foreign exchange rates. The method correctly predicts the direction of change for the next day with an error rate of 47.1%. The error rate reduces to around 40% when rejecting examples where the system has low confidence in its prediction. We show that the symbolic representation aids the extraction of symbolic knowledge from the trained recurrent neural networks in the form of deterministic finite state automata. These automata explain the operation of the system and are often relatively simple. Automata rules related to well-known behavior such as trend following and mean reversal are extracted.

Steve Durbin - July 26, 2002 - Automatic thesaurus construction

Induction of Semantic Classes from Natural Language Text
Dekang Lin and Patrick Pantel
Dept. of Computing Science, University of Alberta

Many applications dealing with textual information require classification of words into semantic classes (or concepts). However, manually constructing semantic classes is a tedious task. In this paper, we present an algorithm, UNICON, for UNsupervised Induction of CONcepts. Some advantages of UNICON over previous approaches include the ability to classify words with low frequency counts, the ability to cluster a large number of elements in a high-dimensional space, and the ability to classify previously unknown words into existing clusters. Furthermore, since the algorithm is unsupervised, a set of concepts may be constructed for any corpus.

Steve Durbin - July 19, 2002 - Innovative Applications of Artificial Intelligence 2002 conference presentation

RightNow eService Center: Internet customer service using a self-learning knowledge base
Stephen D. Durbin, Doug Warner, J. Neal Richter, Zuzana Gedeon
RightNow Technologies

Delivering effective customer service via the Internet requires attention to many aspects of information management if it is to be convenient and satisfying for customers, while at the same time efficient and low in cost for the company or other organization. Built around a core FAQ document knowledge base, RightNow Web eService Center uses a variety of AI techniques to facilitate the construction, maintenance, and navigation of the knowledge base. These include collaborative filtering, swarm intelligence, natural language processing, text clustering, and classification rule learning. Customers using RightNow Web report dramatic decreases in support costs and increases in customer satisfaction due to the ease of use provided by the self-learning features of the knowledge base.

Steve Durbin - June 21, 2002 - Clustering: shared nearest-neighbor algorithm

Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach
Levent Erto"z, Michael Steinbach, Vipin Kumar
Department of Computer Science and Engineering, University of Minnesota

An important task in text mining is finding the dominant topics (and their associated documents) in a collection of documents. While traditional clustering techniques, e.g., hierarchical clustering and K-means, are often used for this task, this paper explores a new clustering algorithm which is based on a shared nearest neighbor approach. Unlike traditional clustering algorithms, not all the data is clustered, but instead, our algorithm discovers groups of documents representing strong topics and discards all documents that are not part of a strong topic. We evaluate the performance of our algorithm on real and synthetic data sets and show that the groups of documents found by our technique often represent more coherent topics than those produced by K-means. Additionally, we provide a theoretical explanation for this behavior.

Related papers from the same group:
A New Shared Nearest Neighbor Clustering Algorithm and its Applications
Has some helpful figures and applications to other data
The Challenges of Clustering High Dimensional Data
Longer paper reviewing various techniques

Steve Durbin - May 17, 2002 - Expertise locating

Expert finding for collaborative virtual environments
Mark Maybury, Ray D'Amore, and David House
MITRE Corporation

This short paper describes "Expert Finder and XperNet, two software programs that automatically profile topics of interest and expertise associated with employees based on employees' tool use, publications, project roles, and written communication with others."

Expertise Recommender: a flexible recommendation system and architecture
David W. McDonald and Mark S. Ackerman
Dept. of Information and Computer Science, Univ. of Calif., Irvine

Locating the expertise necessary to solve difficult problems is a nuanced social and collaborative problem. In organizations, some people assist others in locating expertise by making referrals. People who make referrals fill key organizational roles that have been identified by CSCW and affiliated research. Expertise locating systems are not designed to replace people who fill these key organizational roles. Instead, expertise locating systems attempt to decrease workload and support people who have no other options. Recommendation systems are collaborative software that can be applied to expertise locating. This work describes a general recommendation architecture that is grounded in a field study of expertise locating. Our expertise recommendation system details the work necessary to fit expertise recommendation to a work setting. The architecture and implementation begin to tease apart the technical aspects of providing good recommendations from social and collaborative concerns.

Doug Warner - May 3, 2002 - Wavelets

Approximate Query Processing Using Wavelets
Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, Kyuseok Shim

Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent responsetime requirements of today's Decision Support Systems (DSS). Most work in this area, however, has so far been limited in its query processing scope, typically focusing on specific forms of aggregate queries. Furthermore, conventional approaches based on sampling or histograms appear to be inherently limited when it comes to approximating the results of complex queries over high-dimensional DSS data sets. In this paper, we propose the use of multi-dimensional wavelets as an effective tool for general-purpose approximate query processing in modern, high-dimensional applications. Our approach is based on building wavelet-coefficient synopses of the data and using these synopses to provide approximate answers to queries. We develop novel query processing algorithms that operate directly on thewavelet-coefficient synopses of relational tables, allowing us to process arbitrarily complex queries entirely in the wavelet-coefficient domain. This guarantees extremely fast response times since our approximate query execution engine can dothe bulk of its processing over compact sets of wavelet coefficients, essentially postponing the expansion into relational tuples until the end-result of the query. We also propose a novel wavelet decomposition algorithm that can build these synopses in an I/O-efficient manner. Finally, we conduct an extensive experimental study with synthetic as well as reallife data sets to determine the effectiveness of our wavelet-based approach compared to sampling and histograms. Our results demonstrate that our techniques (1) provide approximate answers of better quality than either sampling or histograms, (2) offer query execution-time speedups of more than two orders of magnitude, and (3) guarantee extremely fast synopsisconstruction times that scale linearly with the size of the data.

TOPIC ISLANDS - A Wavelet Based Text Visualization System.
Nancy E. Miller, Pak Chung Wong, Mary Brewster, and Harlan Foote.
To appear in IEEE Visualization '98.

We present a novel approach to visualize and explore unstructured text. The underlying technology, called TOPIC-O-GRAPHY [TM], applies wavelet transforms to a custom digital signal constructed from words within a document. The resultant multiresolution wavelet energy is used to analyze the characteristics of the narrative flow in the frequency domain, such as theme changes, which is then to the overall thematic content of the text document using statistical methods. The thematic characteristics of a document can be analyzed at varying degrees of detail, ranging from section-sized text partitions to partitions consisting of a few words. Using this technology, we are developing a visualization system prototype known as TOPIC ISLANDS [TM] to browse a document, generate fuzzy document outlines, summarize text by levels of detail and according to user interests, define meaningful subdocuments, query text content, and provide summaries of topic evolution.

Neal Richter - March 22, 2002 - Machine Learning and Natural Language Processing

Machine Learning and Natural Language Processing
Lluís Marquez

In this report, some collaborative work between the fields of Machine Learning (ML) and Natural Language Processing (NLP) is presented. The document is structured in two parts. The first part includes a superficial but comprehensive survey covering the state-of-the-art of machine learning techniques applied to natural language learning tasks. In the second part, a particular problem, namely Word Sense Disambiguation (WSD), is studied in more detail. In doing so, four algorithms for supervised learning, which belong to different families, are compared in a benchmark corpus for the WSD task. Both qualitative and quantitative conclusions are drawn.

A Machine Learning Approach to POS Tagging
Lluís Marquez

We have applied the inductive learning of statistical decision trees and relaxation labelling to the Natural Language Processing (nlp) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve pos ambiguities, consisting of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired decision trees have been directly used in a tagger that is both relatively simple and fast, and which has been tested and evaluated on the Wall Street Journal (wsj) corpus with remarkable accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labelling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine-learned decision trees. Simultaneously, we address the problem of tagging when only limited training material is available, which is crucial in any process of constructing, from the scratch, an annotated corpus. We show that high levels of accuracy can be achieved with our system in this situation, and report some results obtained when using it to develop a 5.5 million words Spanish corpus from scratch.

Christina Hayes - February 22, 2002 - Question answering

Christina Hayes and Chris Yahner
Math, Computer Science, MSU

Christina and Chris will give a demonstration of and discuss a question answering system they built to take natural language questions about dinosaurs and find the answers in a large text database. The Q/A system is still under development, but the goal of this colloquium is to learn about particular problems and discuss some solutions in this area.

Steve Durbin - February 15, 2002 - Ontologies in practice

Making Ontologies Work for Resolving Redundancies across Documents
J. O. Everett et al.
Xerox PARC and Fuji-Xerox Palo Alto Laboratory

Knowledge management efforts over the apst decade have produced many document collections focused on particular domains. As such systems scale up, they become unwieldy and ultimately unusable if obsolete and redundant content is not continually identified and removed.

Discovery of Ontologies from Knowledge Bases
Hendra Suryanto and Paul Compton
School of Computer Science and Engineering, University of New South Wales

Current approacheds to building knowledge-based systems propose the development of an ontology as a precursor to building the problem-solver. This paper outlines an attempt to do the reverse and discover interesting ontologies from systems built without the ontology being explicit. [...]

See also previous papers relating to the Xerox Eureka system and to use of WordNet as an ontology.

Alex Dimitrov - February 8, 2002 - Neural Coding and Information Distortion

MSU campus, Room 108 Ag. Sci., 4:00 pm

This is a talk at MSU in the Complex Biological Systems seminar series. The technique discussed is a general method of extracting meaningful information from noisy signals. The following paper describes the algorithm.

Information Distortion and Neural Coding
Tomas Gedeon, Albert E. Parker, and Alexander G. Dimitrov
Dept. of Mathematical Sciences and Center for Computational Biology, MSU

Our main interest is the question of how neural ensemble activity represents sensory stimuli. In this paper we discuss a new approach to characterizing neural coding schemes. It attempts to describe the specific stimulus parameters encoded in the neural ensemble activity and at the same time determines the nature of the neural symbols with which that information is encoded. This recently developed approach for the analysis of neural coding minimizes an intrinsic information-theoretic cost function (the information distortion) to produce a simple approximation of a coding scheme, which can be refined as more data becomes available. We study this optimization problem. The admissible region is a direct product of simplices. We show that the optimal solution always occurs at a vertex of the admissible region. This allows us to reformulate the problem as a maximization problem on the set of vertices and develop an algorithm, which, under mild conditions, always finds a local extremum. We compare the performance of this new algorithm to standard optimization schemes on synthetic cases and on physiological recordings from the cricket cercal sensory system.

Steve Durbin - January 25, 2002 - Collaborative Knowledge Building with RNW-like Software

Answer Garden 2: Merging Organizational Memory with Collaborative Help
Mark S. Ackerman, David W. McDonald
Department of Information and Computer Science, University of California, Irvine
Proceedings of the ACM Conference on Computer-Supported Cooperative Work (CSCW'96)

This research examines a collaborative solution to a common problem, that of providing help to distributed users. The Answer Garden 2 system provides a second generation architecture for organizational and communitymemory applications. After describing the need for Answer Garden 2's functionality, we describe the architecture of the system and two underlying systems, the Cafe Construction Kit and Collaborative Refinery. We also present detailed descriptions of the collaborative help and collaborative refining facilities in the Answer Garden 2 system.

Evolution of Contact Point: A Case Study of a Help Desk and Its Users
Lena Mamykina, Everyday Computing Lab, GVU Center, Georgia Institute of Technology
Catherine G. Wolf, IBM Thomas J. Watson IBM Research Center
Proceedings of the ACM 2000 Conference on Computer Supported Cooperative Work, pp. 41-48

This paper describes the evolution of a concept, Contact Point, the research process through which it evolved, and the work context and practices which drove its evolution. Contact Point is a web-based application that helps a business manage its relationships with its customers. It can also be used within a business as a means for managing the relationship between parts of the business. In this paper we describe a study of the applicability of Contact Point to the technical services organization and field personnel of a medical device manufacturer. We found that there were opportunities to potentially reduce call volume through Contact Point. We discovered, however, that the technical service representatives sometimes filled roles other than providing information in their telephone conversations with field personnel. These functions included reassuring callers that the callers' answers to questions were correct, providing a rationale for information, and redirecting calls to other departments. The ability to share a document and collaborate in real time was viewed as very valuable. We also discovered that the field personnel need information from a variety of other people in order to do their jobs. These observations were used to enhance the next iteration of Contact Point and to develop strategies for the introduction of Contact Point to users.

Xerox Creates a Knowledge-Sharing Culture Through Grassroots Efforts
Vicki J. Powers, American Productivity & Quality Center

This article, which is a Case Study of the APQC, contains a description of the much heralded Eureka system used by Xerox service reps. Eureka is not available as commercial software, but would be almost trivial to implement with RightNow Web. Also notable in the article is the list of domains of knowledge management developed at Xerox, which brings some definition to the amorphous term "knowledge management."

Experts Exchange
Check out the above website for an innovative, commercially viable, worldwide cooperative learning activity that looks like it could be built with RNW.

Steve Durbin - January 18, 2002 - Natural Language Dialog: Chatbots and Assistants

The unfriendly user: exploring social reactions to chatterbots
Antonella De Angeli, Graham I. Johnson, and Lynne Coventry
NCR Self-Service, Advanced Technology and Research
Proceedings of the International Conference on Affective Human Factors Design
Helander, Khalid and Tham, Editors
Asean Academic Press, London, 2001

This paper presents a preliminary evaluation of Alice, a chatterbot designed in order to elicit anthropomorphic attributions and emotional reactions from those who chat to 'her'. The analysis is based on both transcripts of the interaction and user comments collected in a focus group. Results suggest that the introduction of explicit anthropomorphism in Human-Computer Interaction (HCI) is a complex phenomenon, which could generate strong negative reactions from the part of the user. The finding also demonstrates the importance of placing the development of user interfaces within a social framework as the technology tends to establish relationships with users.
Note: You can chat with Alice and read about her at A.L.I.C.E. AI Foundation. Both software and the AI Markup Language basis for Alice is available there under GPL.

The Role of a Natural Language Conversational Interface in Online Sales: A Case Study
Joyce Chai, et al.
IBM T. J. Watson Research Center and MIT Artificial Intelligence Lab
International Journal of Speech Technology 4, 285-295, 2001

This paper describes the evaluation of a natural language dialog-based navigation system (HappyAssistant) that helps users access e-commerce sites to find relevant information about products and services. The prototype system leverages technologies in natural language processing and human-computer interaction to create a faster and more intuitive way of interacting with websites, especially for less experienced users. The result of a comparative study shows that users prefer the natural language-enabled navigation two to one over the menu driven navigation. In addition, the study confirmed the efficiency of using natural language dialog in terms of the number of clicks and the amount of time required to obtain the relevant information. In the case study, as compared to the menu driven system, the average number of clicks used in the natural language system was reduced by 63.2% and the average time was reduced by 33.3%.

Doug Warner - January 11, 2002 - Knowledge Gap Analysis.

Review: Knowledge Base Maintenance Using Knowledge Gap Analysis
Scott Spangler, Jeffrey Kreulen, IBM Almaden Research Center
Proceedings, KDD 2001, p462-466.

Tangential, skim only: Focused Knowledge Management
John L. Gordon, Michael Edge, NWAIAG, Blackburn College
Applications and Innovations in Intelligent Systems V. Pages 207-219. By SGES Publications. ISBN: 1-899621-19-9

Knowledge Management has been an issue within the NWAIAG since 1993 when several problem cases were identified in organisations. These cases concerned some sort of breakdown in the organisations knowledge asset. A period of research and analysis of the problems preceded the formation of a collaborative group of companies intending to develop tools and systems which support the strategic decision making process. The project, lead by the NWAIAG aims to raise the profile of a company's Knowledge Asset, particularly during the planning and management of change.

In order to help focus the views of a broad range of opinion and experience, the team decided to create a prototype Knowledge Management Tool. This tool, although not intended to be a commercial product, helped to clarify the ideas and opinions of the group and allowed them to develop a common approach. The tool quickly emerged as a bottom up method of implementation for this particular aspect of knowledge asset management. A more complete perspective was developed by refocusing on the demand for information about the knowledge asset, starting at the board room. Furthermore, considerable interest was found in the provision of a graphical visualisation of a companies knowledge asset (or at least part of it) and the dependent relationships which existed within it.

Zuzana Gedeon - December 21, 2001 - Information Bottleneck Method for Clustering.

The Power of Word Clustering for Text Classification
Noam Slonim and Naftali Tishby
Prepared: January 2001. In proceedings of the European Colloquium on IR Research, ECIR 2001.

The recently introduced Information bottleneck method provides an information theoretic framework for extracting features of one variable, that are relevant for the values of another variable. several previous works already suggested applying this method for document clustering, gene expression data analysis, spectral analysis and more. In this work we present a novel implementation of this method for supervised text classification. Specificially, we apply the information bottleneck method to find word-clusters that preserve the information about document categories and use these clusters as features for classification. Previous work used a simmilar clustering procedure to show thet word-clusters can significantly reduce the feature space dimensionality, with only a minor change in classification accuracy. In this work we reproduce these results and go further to show that when the training sample is small word clusters can yield significant improvement in classification accuracy (up to 18%) over the performance using th ewords directly.

(Optional)Multivariate Information Bottleneck
Noam Slonim and Naftali Tishby
Prepared: January 2001. In proceedings of the European Colloquium on IR Research, ECIR 2001.

The information bottleneck method is an unsupervised non-parametric data organization technique. Given a joint distribution P(A,B), this method constructs a new variable T that extracts partitions, or clusters, over the values of A that are informative to B. The information bottleneck has already been applied to document classification, gene expression, neural code, and spectral analysis. In this paper, we introduce a general principled framework for multiwariate extensions of the information bottleneck method

More complete list of papers with local gzipped copies are here.

Zuzana Gedeon - November 30, 2001

(First two papers are handouts they are in proceedings from ICKM 2001 and will be available from ACM library later)

Recent Developments in Text Summarization
Inderjeet Mani
Georgetown University and The MITRE Corporation

With the explosion in the quantity of on-line information in recent years, demand for text summarization technology is growing. Increased pressure for technology advances is coming from users of the web, on-line information sources, and new mobile devices, as well as from the need for corporate knowledge management. Commercial companies are increasingly starting to offer text summarization capabilities, often bundled with information retrieval tools. In this paper, I will discuss the significance of some recent developments summarization technology.

Summarization of Discussion Groups
R. Farrell, P. Fairweather, K. Snyder
IBM T.J.Watson Research Center

In this paper we describe an algorithm to generate textual summaries of discussion groups. Our system combines sentences extracted from individual postings into variable length summaries by utilizing the hierachical discourse context provided by discussion threads. We have incorporated this algorithm into a Web-based application caled IDS (Interactive discussion Summarizer)

Term Weighting Method based on Information Gain Ratio for Summarizing Documents retrieved by IR systems
Tatsunori Mori Miwa Kikuchi Kazufumi Yoshida
Div. of Electrical and Computer Eng., Yokohama National University

This paper proposes a new term weighting method for summarizing documents retrieved by IR systems. Unlike query-biased summarization methods, our method utilizes not the information of query, but the similarity information among original documents by hierarchical clustering. In order to map the similarity structure of the clusters into the weight of each word, we adopt the information gain ratio (IGR) of probabilistic distribution of each word as a term weight. If the amount of information of a word in a cluster increases after the cluster is partitioned into sub-clusters, we may consider that the word contributes to determine the structure of the sub-clusters. The IGR is a measure to express the degree of such contribution. we will show the effectiveness of our mathod based on th eIGR by comparison with other systems

Zuzana Gedeon - November 16, 2001

CIKM 2001
10th International Conference on Information and Knowledge Management
I will briefly summarize some of the talks, I think are interesting, and we can then decide on topics to cover in more detail later. Ones that I found interesting are listed below.

User navigation for web searching
Using navigation data to improve IR functions in the context of Web search
Mark H. Hansen, Elizabeth Shriver
Bell-labs, Lucent

As part of the process of delivering content, devices like proxies and gateways log valuable information about the activities and navigation patterns of users on the Web. In this study, we consider how this navigation data can be used to improve Web search. A query posted to a search engine together with the set of pages accessed during a search task is known as a search session. We develop a mixture model for the observed set of search sessions, and propose variants of the classical EM algorithm for training. The model itself yields a type of navigation-based query clustering. By implicitly borrowing strength between related queries, the mixture formulation allows us to identify the "highly relevant" URLs for each query cluster. Next, we explore methods for incorporating existing labeled data (the Yahoo! directory, for example) to speed convergence and help resolve low-traffic clusters. Finally, the mixture formulation also provides for a simple, hierarchical display of search results based on the query clusters. The effectiveness of our approach is evaluated using proxy access logs for the outgoing Lucent proxy.

Model-based Feedback in the Language Modeling Approach to Information Retrieval
Chengxiang Zhai
Carnegie Mellon University

This is just my personal summary: feedback into original search using statistical methods, instead of query expansion using keywords ..., use results of original query as a new feedback to the search and compare new results based on KL-divergence to to the results retrieved originally.

Steve Durbin - November 9, 2001

Introduction to small-world networks

The physics of the Web
Albert-László Barabási
Department of Physics, University of Notre Dame
The physics of the Web

Statistical mechanics is offering new insights into the structure and dynamics of the Internet, the World Wide Web and other complex interacting systems.

The Small-World Phenomenon: an Algorithmic Perspective
Jon Kleinberg
Department of Computer Science, Cornell University

Long a matter of folklore, the "small-world phenomenon"—the principle that we are all linked by short chains of acquaintances—was inaugurated as an area of experimental study in the social sciences through the pioneering work of Stanley Milgram in the 1960's. This work was among the first to make the phenomenon quantitative, allowing people to speak of the "six degrees of separation" between any two people in the United States. Since then, a number of network models have been proposed as frameworks in which to study the problem analytically. One of the most refined of these models was formulated in recent work of Watts and Strogatz; their framework provided compelling evidence that the small-world phenomenon is pervasive in a range of networks arising in nature and technology, and a fundamental ingredient in the evolution of the World Wide Web.

But existing models are insufficient to explain the striking algorithmic component of Milgram's original findings: that individuals using local information are collectively very effective at actually constructing short paths between two points in a social network. Although recently proposed network models are rich in short paths, we prove that no decentralized algorithm, operating with local information only, can construct short paths in these networks with non-negligible probability. We then define an infinite family of network models that naturally generalizes the Watts-Strogatz model, and show that for one of these models, there is a decentralized algorithm capable of finding short paths with high probability. More generally, we provide a strong characterization of this family of network models, showing that there is in fact a unique model within the family for which decentralized algorithms are effective.

J Kleinberg. The small-world phenomenon: An algorithmic perspective. http://www.cs.cornell.edu/home/kleinber/swn.ps

Doug Warner - November 2, 2001

A Fourier-Spectrum Based Approach to Aggregate and Visualize Decision Trees for Mobile Applications
Hillol Kargupta and Byung-Hoon Park

This paper presents a novel Fourier analysis-based technique to aggregate, transmit, and visualize decision trees in a mobile environment. Fourier representation of a decision tree has several useful properties that are particularly useful for mining continuous data streams from small mobile computing devices. This paper presents algorithms to compute the Fourier spectrum of a decision tree and the vice versa. It offers a framework to aggregate decision trees in their Fourier representations. It also describes a touch-pad/ticker-based approach to visualize decision trees using their Fourier spectrum and an implementation for PDAs.

Neal Richter - Oct 26, 2001

Self-Organizing Maps Optimizations & Insights
Read One of these Two:

Genetic Synthesis of Unsupervised Learning Algorithms

This paper presents new unsupervised learning algorithms that have been synthesized using a genetic approach. A set of such learning algorithms has been compared with the classical Kohonen's Algorithm on the Self-Organizing Map and has been found to provide a better performance measure. This study indicates that there exist many unsupervised learning algorithms that lead to an organization similar to that of Kohonen's Algorithm, and that genetic algorithms can be used to search for optimal algorithms and optimal architectures for the unsupervised learning.

@misc{ das-genetic,
  author = "Ali Dasdan",
  title = "Genetic Synthesis of Unsupervised Learning Algorithms",
  url = "citeseer.nj.nec.com/20882.html" }

GTM: A Principled Alternative to the Self-Organizing Map
Bishop, Svensen, & Williams

The Self-Organizing Map (SOM) algorithm has been extensively studied and has been applied with considerable success to a wide variety of problems. However, the algorithm is derived from heuristic ideas and this leads to a number of significant limitations. In this paper, we consider the problem of modelling the probability density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. We introduce a novel form of latent variable model, which we call the GTM algorithm (for Generative Topographic Map), which allows general non-linear transformations from latent space to data space, and which is trained using the EM (expectationmaximization). We demonstrate the performance of the GTM algorithm on simulated data from flow diagnostics for a multi-phase oil pipeline

Steve Durbin - Oct 19, 2001

Self-organizing maps of document collections

Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration
Krista Lagus, Timo Honkela, Samuel Kaski, and Teuvo Kohonen
Neural Networks Research Centre, Helsinki University of Technology

Powerful methods for interactive exploration and search from collections of free-form textual documents are needed to manage the ever-increasing flood of digital information. In this article we present a method, WEBSOM, for automatic organization of full-text document collections using the self-organizing map (SOM) algorithm. The document collection is ordered onto a map in an unsupervised manner utilizing statistical information of short word contexts. The resulting ordered map where similar documents lie near each other thus presents a general view of the document space. With the aid of a suitable (WWWbased) interface, documents in interesting areas of the map can be browsed. The browsing can also be interactively extended to related topics, which appear in nearby areas on the map. Along with the method we present a case study of its use.

K. Lagus, T. Honkela, S. Kaski, & T. Kohonen.
"Self-Organizing Maps of Document Collections: A New Approach to
   Interactive Exploration,"
In Proceedings of the International Conference on Knowledge Discovery
   and Data Mining, 1996.

WEBSOM demo site: WEBSOM - A novel SOM-based approach to free-text mining
Graphical views of medical and financial data, and the Internet: Visual Net

Python - Oct 12, 2001

Python: choice language for prototyping?

No official reading papers.
Suggested: Python book, by O'REILLY
We will be discussing the python features.
Come if you know, or would like to know about this programming language.
Other web sites of interest:
Python language website has a wealth of info
Link to Python Tutorial, Language/Library Reference and more from Python site
Python Cookbook
Python DevCenter

Zuzana Gedeon - Oct 5, 2001

Finding temporal topics based on graph theory.

Alexandrin Popescul, Gary Flake, Steve Lawrence, Lyle H. Ungar, C. Lee Giles
Clustering and Identifying Temporal Trends in Document Databases (2000)
DaimlerChrysler Research & Technology; Stanford University

We introduce a simple and efficient method for clustering and identifying temporal trends in hyper-linked document databases. Our method can scale to large datasets because it exploits the underlying regularity often found in hyper-linked document databases. Because of this scalability, we can use our method to study the temporal trends of individual clusters in a statistically meaningful manner. As an example of our approach, we give a summary of the temporal trends found in a scientific literature database with thousands of documents.

@inproceedings{ popescul00clustering,
  author = "Alexandrin Popescul and Gary Flake
     and Steve Lawrence and Lyle Ungar and C. Lee Giles",
  title = "Clustering and Identifying Temporal Trends in Document
  booktitle = "Advances in Digital Libraries, ADL 2000",
  address = "Washington, DC",
  pages = "173-182",
  year = "2000",
  url = "citeseer.nj.nec.com/popescul00clustering.html" }

Erez Hartuv, Ron Shamir
A Clustering Algorithm based on Graph Connectivity (1999)
Dept. of Computer Sciences, Tel-Aviv University

We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. A similarity graph is defined and clusters in that graph correspond to highly connected subgraphs. A polynomial algorithm to compute them efficiently is presented. Our algorithm produces a solution with some provably good properties and performs well on simulated and real data. Keywords: Algorithms, Clustering, Minimum cut, Graph connectivity, diameter.


Steve Durbin - Sept 28, 2001 - Conversational access to information

Mehmet H. Goker, Cynthia A. Thompson
Personalized Conversational Case-Based Recommendation
Dept. of Computer Science, NEC Research Institute, Princeton

In this paper, we describe the Adaptive Place Advisor, a user adaptive, conversational recommendation system designed to help users decide on a destination, specifically a restaurant. We view the selection of destinations as an interactive, conversational process, with the advisory system inquiring about desired item characteristics and the human responding. The user model, which contains preferences regarding items, attributes, values, value combinations, and diversification, is also acquired during the conversation. The system enhances the user's requirements with the user model and retrieves suitable items from a case-base. If the number of items found by the system is unsuitable (too high, too low) the next attribute to be constrained or relaxed is selected based on the information gain associated with the attributes. We also describe the current status of the system and future work.

Neal Richter - Sept 21, 2001 - Feature Selection for Text

David D. Lewis
Feature Selection and Feature Extraction for Text Categorization (1992)

The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features) despite the large training sets. The extraction of new text features by syntactic analysis and feature clustering was investigated on the Reuters data set. Syntactic indexing phrases, clusters of these phrases, and clusters of words were all found to provide less effective representations than individual words.

@inproceedings{ lewis92feature,
    author = "D. D. Lewis",
    title = "{Feature Selection and Feature Extraction for Text
    booktitle = "Proceedings of Speech and Natural Language Workshop",
    publisher = "Morgan Kaufmann",
    address = "San Mateo, California",
    pages = "212-217",
    year = "1992",
    url = "citeseer.nj.nec.com/lewis92feature.html" }

C. Cardie and D. Pierce.
Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification. ACL/Coling-98, 1998, 218-224. Association for Computational Linguistics.

This paper addresses the problem of handling skewed class distributions within the case-based learning (CBL) framework. We first present as a baseline an informationgain-weighted CBL algorithm and apply it to three data sets from natural language processing (NLP) with skewed class distributions. Although overall performance of the baseline CBL algorithm is good, we show that the algorithm exhibits poor performance on minority class instances. We then present two CBL algorithms designed to improve the performance of minority class predictions. Each variation creates test-case-specific feature weights by first observing the path taken by the test case in a decision tree created for the learning task, and then using pathspecific information gain values to create an appropriate weight vector for use during case retrieval. When applied to the NLP data sets, the algorithms are shown to significantly increase the accuracy of minority class predictions while maintaining or improving overall classification accuracy.

Round-Robin DeathMatch! (1 paper each) - September 7, 2001 - Hierarchical Taxonomies using Divisive Partitioning.

Zuzana Gedeon
Hierarchical Taxonomies using Divisive Partitioning.(1998)
Daniel Boley

We propose an unsupervised divisive partitioning algorithm for document data sets which enjoys many favorable properties. In particular, the algorithm shows excellent scalability to large data collections and produces high quality clusters which are competitive with other clustering methods. The algorithm yields information on the significant and distinctive words within each cluster, and these words can be inserted into the naturally occuring hierarchical structure produced by the algorithm. The result is an automatically generated hierarchical topical taxonomy of a document set. In this paper, we show how the algorithm's cost scales up linearly with the size of the data, illustrate experimentally the quality of the clusters produced, and show how the algorithm can produce a hierarchical topical taxonomy. Keywords: unsupervised clustering, hierarchical taxonomy, scalability, text document data mining

@misc{ boley98hierarchical,
  author = "D. Boley",
  title = "Hierarchical taxonomies using divisive partitioning",
  text = "D. L. Boley. Hierarchical taxonomies using divisive
     partitioning. Technical Report TR-98-012, Department of Computer
     Science, University of Minnesota, Minneapolis, 1998.",
  year = "1998",
  url = "citeseer.nj.nec.com/boley98hierarchical.html" }

Steve Durbin
A Procedure for Multi-Class Discrimination and Some Linguistic Applications by V. Pericliev and R.E. Valdes-Perez
(COLING-ACL), Montreal, 1998

The paper describes a novel computational tool for multiple concept learning. Unlike previous approaches like ID3 or C4.5, whose major goal is prediction on unseen instances rathe r than the legibility of the output, our MPD (Maximally Parsimonious Discrimination) program emphasizes the conciseness and intelligibility of the resultant class descriptions, using three intuitive simplicity criteria to this end. We illustrate MPD with some additional application s than those commonly associated with the mentioned algorithms, such as learning verbal case f rames, translational correspondences or morphological rules. These include componential analys is (in lexicology and phonology), language typology, and speech pathology.

Neal Richter
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization.
by Maria Fernanda Caropreso, Stan Matwin, and Fabrizio Sebastiani
Text Databases and Document Management: Theory and Practice

In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set Gk of n word stems, and we say that Gk occurs in a document Dj when a sequence of words appears in Dj that, after stop word removal and stemming, consists exactly of the n stems in Gk, in some order. Previous researches have investigated the use of n-grams (or some variant of them) in the context of specific learning algorithms, and thus have not obtained general answers on their usefulness for TC. In this work we investigate the usefulness of n-grams in TC independently of any specific learning algorithm. We do so by applying feature selection to the pool of all k-grams (k <= n), and checking how many n-grams score high enough to be selected in the top sigma k-grams. We report the results of our experiments, using various feature selection measures and varying values of sigma, performed on the Reuters-21578 standard TC benchmark. We also report results of making actual use of the selected n-grams in the context of a linear classifier induced by means of the Rocchio method.

Round-Robin DeathMatch! (1 paper each) - August 31, 2001

Bikramjit Banerjee
Question Answering by Passage Selection by C.L.A Clarke, et al.
Proceedings of the TREC-9, 2000

Abstract: Not Availiable

Neal Richter
FALCON: Boosting Knowledge for Answer Engines by S. Harabagiu, et al.
Proceedings of the TREC-9, 2000

This paper presents the features of Falcon, an answer engine that integrates different forms of syntactic, semantic and pragmatic knowledge for the goal of achieving better performance. The answer engine handles question reformulations, finds the expected answer type from a large hierarchy that incorporates the WordNet semantic net and extracts answers after performing unifications on the semantic forms of the question and its candidate answers. To rule out erroneous answers, it provides a justification option, implemented as an abductive proof. In TREC-9, Falcon generated a score of 58% for short answers and 76% for long answers.

Zuzana Gedeon
Representing sentence structure in HMM for information extraction by S.Ray, M.Craven
IJCAI-01 ie.hmm.ijcai01.pdf

We study the application of Hidden Markov Models (HMMs) to learning information extractors for n-ary relations from free text. We propose an approach to representing the grammatical structure of sentences in the states of the model. We also investigate using an objective function during HMM training which maximizes the ability of the learned models to identify the phrases of interest. We evaluate our methods by deriving extractors for two binary relations in biomedical domains. Our experiments indicate that our approach learns more accurate models than several baseline approaches.

Steve Durbin
Scaling Question Answering to the Web by U of Washington.

The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as ``who was the first American in space?'' or ``what is the second tallest mountain in the world?'' Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions. In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance.
First we introduce MULDER, which we believe to be the first general-purpose, fully-automated question-answering system available on the web. Second, we describe MULDER's architecture, which relies on multiple search-engine queries, natural-language parsing, and a novel voting procedure to yield reliable answers coupled with high recall. Finally, we compare MULDER's performance to that of Google and AskJeeves on questions drawn from the TREC-8 question track. We find that MULDER's recall is more than a factor of three higher than that of AskJeeves. In addition, we find that Google requires 6.6 times as much user effort to achieve the same level of recall as MULDER.

Neal Richter, Bikramjit Banerjee - August 17, 2001

Review of a few interesting papers from IJCAI-2001

Handout on Question Answering Systems
Tutorial by Dan Molovan & Sandra Harabagiu

Several Other Papers to Peruse Add Links Later
NLP Driver IR
Automatic Sumamry Generation

Steve Durbin, Zuzana Gedeon - August 10, 2001 - New Zealand Digital Library Project overview

Cancelled due to company party. Have a look for future reference!

The goal of our research program is to explore the potential of internet-based digital libraries. Our vision is to develop systems that automatically impose structure on anarchic, uncatalogued, distributed repositories of information, thereby providing information consumers with effective tools to locate what they need and to peruse it conveniently and comfortably.
Project members are actively working on techniques for creating, managing, and and mainatining collections; extracting metadata from legacy documents; analysing library usage and user needs; Maori, Arabic and Chinese language systems; internationalising the library interface; optical music recognition and musical collections; novel interfaces for formulating queries and visualising results; novel interfaces for browsing metadata; text mining for keyphrases, acronyms, and other metadata; keyphrase extraction and phrase-based browsing; and other research topics.

This page has links to a variety of collections in different languages that make use of the NZDL tools. Documentation, research descriptions, software downloads(GPL), and publications are available.
New Zealand Digital Library

A very readable review paper by project members is:
Applications of Machine Learning in Information Retrieval
J. Cunningham, J. Littin, and I. H. Wittin
Dept. of Computer Science, Univ. of Waikato, New Zealand

 Proceedings of the Ninth National Conference on Artificial
 San Jose CA, AAAI Press, pp. 47-52.
 Cunningham, S. J., Litten, J., Witten, I. H. (1997).
 Applications of machine learning in Information retrieval.
 Working paper 97/6. C.S department, University of Waikato, New

Zuzana Gedeon - July 27, 2001 - Novelty Detection, DA and other new methods.

A Hierarchical Probabilistic Model for Novelty Detection in Text (1999)
L. Douglas Baker, Thomas Hofmann Andrew K. McCallum, Yiming Yang

Topic Detection and Tracking (TDT) is a variant of classification in which the classes are not known or fixed in advance. Consider for example an incoming stream of news articles or email messages that are to be classified by topic; new classes must be created as new topics arise. The problem is a challenging one for machine learning. Instances of new topics must be recognized as not belonging to any of the existing classes (detection), and instances of old topics must be correctly classified (tracking)&#151;often with extremely little training data per class. This paper proposes a new approach to TDT based on probabilistic, generative models. Strong statistical techniques are used to address the many chalenges: hierarchical shrinkage for sparse data, statistical "garbage collection" for new event detection, clustering in time to separate the different events of a common topic, and deterministic annealing for creating the hierarchy. Preliminary experimental results show promise.

@misc{ baker99hierarchical,
  author = "D. Baker and T. Hofmann and A. McCallum and Y. Yang",
  title = "A hierarchical probabilistic model for novelty detection in
  text = "D. Baker, T. Hofmann, A. McCallum, and Y. Yang. 1999. A
     hierarchical probabilistic model for novelty detection in
     text. In ICML-99. In submission.",
  year = "1999",
  url = "citeseer.nj.nec.com/article/baker99hierarchical.html" }

Neal Richter - July 13, 2001 - Adaptive Clustering

Cluster-Based Adaptive Information Retrieval
Jay N. Bhuyany, Jitender S. Deogunz, Vijay V. Raghavan

This paper discusses the issues involved in the design of a complete information retrieval system based on useroriented clustering schemes. Clusters are constructed taking into account the users' perception of similarity between documents. The system accumulates feedback from the users and employs it to construct useroriented clusters. An optimization function to improve the effectiveness of the clustering process is developed. A retrieval process based on the clustering scheme is described. The system developed is experimentally validated and compared with existing systems.

Adaptive Clustering Of Hypermedia Documents
Andrew Johnson, Farshad Fotouhi

A hypermedia system connects various types of information into a network where related nodes of information (text, audio, video) are connected by links. Clustering these nodes is an effective way to reduce information-overhead, allowing the user to browse through the clusters as well as the individual nodes. In this paper, we compare the use of two adaptive algorithms (genetic algorithms, and neural networks) in clustering hypermedia documents. These clusters allow the user to index into this overwhelming number of nodes and find needed information quickly. We base the clustering on the user's paths through the hypermedia document and not on the content of the nodes or the structure of the links in the document, thus the clustering reflects the unique relationships each user sees among the nodes. The original hypermedia document remains untouched, however each user will now have a personalized index into this document.

@article{ johnson96adaptive,
    author = "Andrew Johnson and Farshad Fotouhi",
    title = "Adaptive Clustering of Hypermedia Documents",
    journal = "Information Systems",
    volume = "21",
    number = "6",
    pages = "459-473",
    year = "1996",
    url = "citeseer.nj.nec.com/article/johnson94adaptive.html" }

Bikramjit Banerjee - June 22, 2001 - Scalable BIRCH

Clustering Large Datasets in Arbitrary Metric Spaces
V. Ganti, R. Ramakrishnan and J. Gehrke

Clustering partitions a collection of objects into groups called clusters, such that similar objects fall into the same group. Similarity between objects is defined by a distance function satisfying the triangle inequality; this distance function along with the collection of objects describes a distance space. In a distance space, the only operation possible on data objects is the computation of distance between them. All scalable algorithms in the literature assume a special type of distance space, namely a k-dimensional vector space, which allows vector operations on objects.We present two scalable algorithms designed for clustering very large datasets in distance spaces. Our first algorithm BUBBLE is, to our knowledge, the first scalable clustering algorithm for data in a distance space. Our second algorithm BUBBLE-FM improves upon BUBBLE by reducing the number of calls to the distance function, which maybe computationally very expensive. Both algorithms make only a single scan over the database while producing high clustering quality. In a detailed experimental evaluation, we study both algorithms in terms of scalability and quality of clustering. We also show results of applying the algorithms to a real-life dataset.

 V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French.
 Clustering large datasets in arbitrary metric spaces.
 Technical report, University of Wisconsin-Madison, 1998.

Doug Warner - June 15, 2001 - Targeted Marketing

Segmentation-Based Modeling for Advanced Targeted Marketing
C. Apte, E. Bibelnieks, R. Natarajan, E. Pednault, F. Tipu, D. Campbell, B. Nelson
IBM T.J. Watson Research Center and Fingerhut Business Intelligence

Fingerhut Business Intelligence (BI) has a long and successful history of building statistical models to predict consumer behavior. The models constructed are typically segmentation-based models in which the target audience is split into subpopulations (i.e., customer segments) and individually tailored statistical models are then developed for each segment. Such models are commonly employed in the direct-mail industry; however, segmentation is often performed on an ad-hoc basis without directly considering how segmentation affects the accuracy of the resulting segment models. Fingerhut B I approached IBM Research with the problem of how to build segmentation-based models more effectively so as to maximize predictive accuracy. The IBM Advanced Targeted Marketing&#151;Single Events(TM) (IBM ATM -SE(TM)) solution is the result of IBM Research and Fingerhut BI directing their efforts jointly towards solving this problem. This paper presents an evaluation of ATMSE's modeling capabilities using data from Fingerhut's catalog mailings.

Steve Durbin - June 8, 2001 - Ontologies I

Using WordNet in a Knowledge-Based Approach to Information Retrieval
R. Richardson and A. F. Smeaton
Dublin City University

The application of natural language processing tools and techniques to information retrieval tasks has long since been identified as potentially useful for the quality of information retrieval. Traditionally, IR has been based on matching words or terms in a query with words or terms in a document. In this paper we introduce an approach to IR based on computing a semantic distance measurement between concepts or words and using this word distance to compute a similarity between a query and a document. Two such semantic distance measures are presented in this paper and both are benchmarked on queries and documents from the TREC collection. Although our results in terms of precision and recall are disappointing, we rationalise this in terms of our experimental setup and our results show promise for future work in this area.

Steve Durbin - May 11, 2001 - Topic words

Finding Topic Words for Hierarchical Summarization
Dawn Laurie, W. Bruce Croft, and Arnold Rosenberg
University of Massachusetts, Amherst

Hierarchies have long been used for categorization, summarization, and access to information. In this paper we define summarization in terms of a probabilistic language model and use the definition to explore a new technique for automatically generating topic hierarchies by applying a graph-theoretic algorithm, which is an approximation of the Dominating Set Problem. The algorithm efficiently generates terms according to a language model. We compare the new technique to previous methods proposed for constructing topic hierarchies including subsumption and lexical hierarchies, as well as words found using TF-IDF. Our results show that the new technique performs as well as or better than these other techniques.

Neal Richter - May 4, 2001 - Search Engines

The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page, Founders of Google

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine&#151;the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

Fulcrum SearchServer 4.0
Hummingbird Ltd
SearchServer Technical Whitepaper

Fulcrum SearchServer is a robust, multi-platform indexing and retrieval engine that delivers full-function text retrieval capabilities to the end user either from a custom desktop application or from the Web. SearchServer features a comprehensive set of advanced text retrieval capabilities, a scaleable client/server distributed processing architecture, open, standards-based interfaces and superior performance. These features are all built on a proven search and retrieval technology. Long-term compatibility and interoperability with other system components is assured through Hummingbird's commitment to industry standards&#151;including use of the familiar SQL query and data manipulation model and the Open Database Connectivity (ODBC) standard. SearchServer accesses text in its original format and location, ensuring seamless integration with existing corporate infrastructures such as word processors and databases. The Hummingbird application development products, used in conjunction with other popular development tools, enable rapid prototyping, development and deployment of SearchServer-based custom application solutions.

Bikramjit Banerjee - April 27, 2001 - Automated answering of questions

An Intelligent System for Question Answering
Sanda M. Harabagiu, Dan I. Moldovan

This paper introduces a system intended for question answering based on abductive inference. The system uses a large knowledge base structured around WordNet. Texts and questions are uniformly processed and semantic paths between concepts are established using a marker propagation method. The paths bring forward inferences and implicatures that are otherwise difficult to extract. Implicit inferences contribute to the information provided by explicit inferences, producing answers for a variety of question types.

Zuzana Gedeon - April 20, 2001 - Frequent sets - an approach for meaningfull labeling?

Finding All Maximal Frequent Sequences in Text
Helena Ahonen-Myka

In this paper we present a novel algorithm for discovering maximal frequent sequences in a set of documents, i.e., such sequences of words that are frequent in the document collection and, moreover, that are not contained in any other longer frequent sequence. A sequence is considered to be frequent if it appears in at least oe documents, when oe is the frequency threshold given. Our approach combines bottom-up and greedy methods, and, hence, is able to extract maximal sequences without considering all the frequent subsequences of them. This is a necessity, since maximal frequent sequences in documents may be rather long. When we have found the set of maximal frequent sequences for each...

Helena Ahonen. Finding all maximal frequent sequences in text.
In ICML99 Workshop, Machine Learning in Text Data Analysis, Bled,
   Slovenia, 1999.

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Jian Pei, Jiawei Han, and Runying Mao

Association mining may often derive an undesirably large set of frequent itemsets and association rules. Recent studies have proposed an interesting alternative: mining frequent closed itemsets and their corresponding rules, which has the same power as association mining but substantially reduces the number of rules to be presented. In this paper, we propose an efficient algorithm, CLOSET, for mining closed itemsets, with the development of three techniques: (1) applying a compressed, frequent pattern tree FP-tree structure for mining closed itemsets without candidate generation, (2) developing a single prefix path compression technique to identify frequent closed itemsets quickly, and (3)...

Neal Richter - March 30, 2001 - Summarization & Data Mining

Automatic Labeling of Document Clusters
Alexandrin Popescul, Lyle H. Ungar (U of Penn.)

Automatically labeling document clusters with words which indicate their topics is difficult to do well. The most commonly used method, labeling with the most frequent words in the clusters, ends up using many words that are virtually void of descriptive power even after traditional stop words are removed. Another method, labeling with the most predictive words, often includes rather obscure words. We present two methods of labeling document clusters motivated by the model that words are generated by a hierarchy of mixture components of varying generality. The first method assumes existence of a document hierarchy (manually constructed or resulting from a hierarchical clustering algorithm) and uses a 2 test of significance to detect different word usage across categories in the hierarchy. The second method selects words which both occur frequently in a cluster and effectively discriminate the given cluster from the other clusters. We compare these methods on abstracts of documents selected from a subset of the hierarchy of the Cora search engine for computer science research papers. Labels produced by our methods showed superior results to the commonly employed methods.

@misc{ popescul-automatic,
  author = "Alexandrin Popescul and Lyle H. Ungar",
  title = "Automatic Labeling of Document Clusters",
  url = "citeseer.nj.nec.com/popescul00automatic.html" }

A Mutually Beneficial Integration of Data Mining and Information Extraction
Un Yong Nahm and Raymond J. Mooney (U of Texas)

Text mining concerns applying data mining techniques to unstructured text. Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents, transforming unstructured text into a structured database. This paper describes a system called DiscoTEX, that combines IE and data mining methodologies to perform text mining as well as improve the performance of the underlying extraction system. Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE. Encouraging results are presented on applying these techniques to a corpus of computer job postings from an Internet newsgroup.

Zuzana Gedeon - March 23, 2001 - Deterministic Annealing

Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems
K. Rose

The deterministic annealing approach to clustering and its extensions has demonstrated substantial performance improvements over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification, and statistical regression. The method offers three important features: 1) the ability to avoid many poor local optima; 2) applicability to many different structures/architectures; and 3) the ability to minimize the right cost function even when its gradient vanish almost everyvhere, as is the case of the empirical classification error...

This paper describes deterministic annealing method and its aplicability for various clustering problems. It has been used in speech recognition and vector quantization. It is long paper (30 pages) so we should skip most of the details there. read first 3 pages (introduction + part A. in chapter II. up to (principled derivation...). On page 9 (pp.2218) is an algorithm description.

Kenneth Rose. Deterministic annealing for clustering, compression,
classification, regression, and related optimization problems.
Proceedings of the IEEE, 86:11:2210-2239, 1998.

Clustering algorithms for cricket neural data.
Zuzana's MS project report

In our paper we describe the deterministic annealing approach to clustering, test it on two and three dimensional data sets with and without the noise, and propose a novel application of this method to clusering of cricket neural data.

Steve Durbin - March 16, 2001 - Machine Translation

Toward a Scoring Function for Quality-Driven Machine Translation
Douglas A. Jones and Gregory M. Rusk

We describe how we constructed an automatic scoring function for machine translation quality; this function makes use of arbitrarily many pieces of natural language processing software that has been designed to process English language text. By machine-learning values of functions available inside the software and by constructing functions that yield values based upon the software output, we are able to achieve preliminary, positive results in machine-learning the difference between human-produced English and machine-translation English. We suggest how the scoring function may be used for MT system development.

Doug Warner - March 2, 2001 - Summarization

Query-Relevant Summarization Using FAQs
Adam Berger, Vibhu O. Mittal

This paper introduces a statistical model for query-relevant summarization: succinctly characterizing the relevance of a document to a query. Learning parameter values for the proposed model requires a large collection of summarized documents, which we do not have, but as a proxy, we use a collection of FAQ (frequently-asked question) documents. Taking a learning approach enables a principled, quantitative evaluation of the proposed system, and the results of some initial experiments&#151;on a collection of Usenet FAQs and on a FAQ-like set of customer-submitted questions to several large retail companies&#151;suggest the plausibility of learning for summarization.

@misc{ vibhu-queryrelevant,
  author = "Adam Berger Vibhu",
  title = "Query-Relevant Summarization using FAQs",
  url = "citeseer.nj.nec.com/341776.html" }

For review, see also July 28, 2000: Summarization.

Steve Durbin - February 23, 2001 - Shallow Parsing

A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation
Eric Brill, Phillip Resnik

In this paper, we describe a new corpus-based approach to prepositional phrase attachment disambiguation, and present results comparing performance of this algorithm with other corpus-based approaches to this problem.

[This illustrates adaptation of the Brill method to problems similar to those of identifying negation and adverbial multiplication as needed for Emotix. See also description of Brill method.]

E. Brill and P. Resnik.
A rule-based approach to prepositional phrase attachment
In 15th International Conference on Computational Linguistics
   (COLING94) , Kyoto, Japan, 1994.

Bikramjit Banerjee - February 16, 2001 - Distributional Clustering

Distributional Clustering of Words for Text Classification
L. Douglas Baker, Andrew McCallum

This paper applies Distributional Clustering (Pereira et al. 1993) to document classification. The approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionality-reduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy. Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy&#151;significantly better than Latent Semantic Indexing (Deerwester et al. 1990), class-based clustering (Brown et al. 1992), feature selection by mutual information (Yang and Pederson 1997), or Markovblanket-based feature selection (Koller and Sahami 1996). We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering.

@inproceedings{ baker98distributional,
    author = "L. Douglas Baker and Andrew K. McCallum",
    title = "Distributional clustering of words for text
    booktitle = "Proceedings of {SIGIR}-98,
    21st {ACM} International Conference on Research and Development in
       Information Retrieval",
    publisher = "ACM Press, New York, US",
    address = "Melbourne, AU",
    editor = "W. Bruce Croft and Alistair Moffat
              and Cornelis J. van Rijsbergen and Ross Wilkinson
              and Justin Zobel",
    pages = "96-103",
    year = "1998",
    url = "citeseer.nj.nec.com/baker98distributional.html" }

Neal Richter - February 9, 2001 - User Profiling & Adaptive Web Sites

Mining Web Access Logs Using Relational Competitive Fuzzy Clustering
Olfa Nasraoui, Hichem Frigui, Anupam Joshi, Raghu Krishnapuram

The proliferation of information on the World-Wide Web has made the personalization of this information space a necessity. An important part of Web personalization is to mine typical user profiles from the vast amount of historical data stored in access logs. In this paper, we define the notion of a "user session" and a new distance measure between two web sessions that captures the organization of a web site. A competitive agglomeration clustering algorithm which can automatically cluster data into a parsimonious number of components is used to analyze server access logs and obtain typical session profiles of users.

@misc{ nasraoui-mining,
  author = "O. Nasraoui and H. Frigui and A. Joshi and
     R. Krishnapuram",
  title = "Mining Web Access Logs Using Relational Competitive Fuzzy
  text = "O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapuram,
     Mining Web Access Logs Using Relational Competitive Fuzzy
     Clustering, to be presented at the Eight International Fuzzy
     Systems Association World Congress - IFSA 99, Taipei, August
  url = "citeseer.nj.nec.com/286359.html" }

Improving the Effectiveness of a Web Site with Web Usage Mining
Myra Spiliopoulou, Carsten Pohlek, Lukas C. Faulstich

Effective web presence is for many companies indispensable for their success to the global market. In recent years, several methods have been developed for measuring and improving the effectiveness of commercial sites. However, they mostly concentrate on web page design and on access analysis. In this study, we propose a methodology of assessing the quality of a web site in turning its users into customers. Our methodology is based on the discovery and comparison of navigation patterns of customers and non-customers. This comparison leads to rules on how the site's topology should be improved. We further propose a technique for dynamically adapting the site according to those rules.

Doug Warner - February 2, 2001 - Visualizing Graphs

Multilevel Visualization of Clustered Graphs
Peter Eades, Qingwen Feng

Clustered graphs are graphs with recursive clustering structures over the vertices. This type of structure appears in many systems. Examples include CASE tools, management information systems, VLSI design tools, and reverse engineering systems. Existing layout algorithms represent the clustering structure as recursively nested regions in the plane. However, as the structure becomes more and more complex, two dimensional plane representations tend to be insufficient. In this paper, firstly, we describe some two dimensional plane drawing algorithms for clustered graphs; then we show how to extend two dimensional plane drawings to three dimensional multilevel drawings. We consider two conventions: straight-line convex drawings and orthogonal rectangular drawings; and we show some examples.

@inproceedings{ eades96multilevel,
    author = "Peter Eades and Qing-Wen Feng",
    title = "Multilevel Visualization of Clustered Graphs",
    booktitle = "Proc. Graph Drawing, {GD}",
    number = "1190",
    month = "18-20~",
    publisher = "Springer-Verlag",
    address = "Berlin, Germany",
    pages = "101-112",
    year = "1996",
    url = "citeseer.nj.nec.com/eades97multilevel.html" }

Darin Rambo - January 26, 2001 - User Interface Insights

Darin will be giving handouts

Bikram Banerjee - January 19, 2001 - Optimization for Hierarchical Clustering

Iterative Optimization and Simplification of Hierarchical Clusterings
Doug Fisher

Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can partition the search so that a system inexpensively constructs a `tentative' clustering for initial examination, followed by iterative optimization, which continues to search in background for improved clusterings. Given this motivation, we evaluate an inexpensive strategy for creating initial clusterings, coupled with several control strategies for iterative optimization, each of which repeatedly modifies an initial clustering in search of a better one. One of these methods appears novel as an iterative optimization strategy in clustering contexts. Once a clustering has been constructed it is judged by analysts&#151;often according to task-specific criteria. Several authors have abstracted these criteria and posited a generic performance task akin to pattern completion, where the error rate over completed patterns is used to `externally' judge clustering utility. Given this performance task, we adapt resampling-based pruning strategies used by supervised learning systems to the task of simplifying hierarchical clusterings, thus promising to ease post-clustering analysis. Finally, we propose a number of objective functions, based on attribute-selection measures for decision-tree induction, that might perform well on the error rate and simplicity dimensions.

@article{ fisher96iterative,
    author = "Doug Fisher",
    title = "Iterative Optimization and Simplification of Hierarchical
    journal = "Journal of Artificial Intelligence Research",
    volume = "4",
    pages = "147-180",
    year = "1996",
    url = "citeseer.nj.nec.com/fisher95iterative.html" }

Neal Richter - January 12, 2001 - Interactive Clustering

Relevance and Reinforcement in Interactive Browsing
Anton Leuski

We consider the problem of browsing the top ranked portion of the documents returned by an information retrieval system. We describe an interactive relevance feedback agent that analyzes the inter-document similarities and can help the user to locate the interesting information quickly. We show how such an agent can be designed and improved by using neural networks and reinforcement learning. We demonstrate that its performance significantly exceeds the performance of the traditional relevance feedback approach.

@misc{ leuski-relevance,
  author = "Anton Leuski",
  title = "Relevance and Reinforcement in Interactive Browsing",
  url = "citeseer.nj.nec.com/leuski00relevance.html" }
 A. Leuski. Relevance and reinforcement in interactive browsing.
 In Proceedings of Ninth International Conference
 on Information and Knowledge Management (CIKM'00),
 pages 119-126, November 2000.

Interactive Cluster Visualization for Information Retrieval
James Allan, Anton V. Leouski, and Russell C. Swan

This study investigates the ability of cluster visualization to help a user rapidly identifyrelevant documents. It provides added support for the truth of the Cluster Hypothesis on retrieved documents and shows that clustering of relevant documents is readily visible. The study then shows the visual effect of a technique similar to relevance feedback and shows how to enhance that effect to further help the user locate relevant material.

J. Allan, A. Leouski, and R. Swan.
Interactive Cluster Visualization for Information Retrieval.
Technical Report IR-116, Center for Intelligent Information Retrieval,
University of Massachusetts, Amherst, 1997.

Steve Durbin - January 5, 2001 - Part of Speech Tagging

Some Advances in Transformation-Based Part of Speech Tagging
Eric Brill

Most recent research in trainable part of speech taggers has explored stochastic tagging. While these taggers obtain high accuracy, linguistic information is captured indirectly, typically in tens of thousands of lexical and contextual probabilities. In (Brill 1992), a trainable rule-based tagger was described that obtained performance comparable to that of stochastic taggers, but captured relevant linguistic information in a small number of simple non-stochastic rules. In this paper, we describe a number of extensions to this rulebased tagger. First, we describe a method for expressing lexical relations in tagging that stochastic taggers are currently unable to express. Next, we show a rule-based approach to tagging unknown words. Finally, we show how the tagger can be extended into a k-best tagger, where multiple tags can be assigned to words in some cases of uncertainty.

Brill E. "Some advances in transformation-based part of speech
In Proceedings of the Twelfth National Conference on Artificial
(AAAI-94). 1994.

Doug Warner - December 15, 2000 - LAPART Neural Networks

Acquiring rule sets as a product of learning in a logical neural architecture
Healy, M.J.; Caudell, T.P.
Electronic copy pending, Paper copy available

Envisioning neural networks as systems that learn rules calls forth the verification issues already being studied in knowledge-based systems engineering, and complicates these with neural-network concepts such as nonlinear dynamics and distributed memories. We show that the issues can be clarified and the learned rules visualized symbolically by formalizing the semantics of rule-learning in the mathematical language of two-valued predicate logic. We further show that this can, at least in some cases, be done with a fairly simple logical model. We illustrate this with a combination of two example neural-network architectures, LAPART, designed to learn rules as logical inferences from binary data patterns, and the stack interval network, which converts real-valued data into binary patterns that preserve the semantics of the ordering of real values. We discuss the significance of the formal model in facilitating the analysis of the underlying logic of rule-learning and numerical data representation. We provide examples to illustrate the formal model, with the combined stack interval/LAPART networks extracting rules from numerical data.

Bikram Banerjee - December 8, 2000 - Similarity Assessment

General-Purpose Case-Based Planning: Methods and Systems
Ralph Bergmann, Héctor Muñoz-Avila, and Manuela Veloso

First Paragraph:
Planning consists of the construction of some course of actions to achieve a specified set of goals, starting from an initial situation. For example, determining a plan of actions (often called operators) for transporting a certain set of goods from a source location to some destination provided that there are different means of transportation is a typical planning problem in the transportation domain. Many planning problems of practical interest also appear in engineering domains, e.g., the domain of process planning.

M. Veloso, H. Munoz-Avila, and R. Bergmann.
General-purpose case-based planning: Methods and systems. AI
9(3):128-137, 1996.

(OPTIONAL) Constructive Similarity Assessment: Using Stored Cases to Define New Situations
David B. Leake

A fundamental issue in case-based reasoning is similarity assessment: determining similarities and differences between new and retrieved cases. Many methods have been developed for comparing input case descriptions to the cases already in memory. However, the success of such methods depends on the input case description being sufficiently complete to reflect the important features of the new situation, which is not assured. In case-based explanation of anomalous events during story understanding, the anomaly arises because the current situation is incompletely understood; consequently, similarity assessment based on matches between known current features and old cases is likely to fail because of gaps in the current case's description.

Our solution to the problem of gaps in a new case's description is an approach that we call constructive similarity assessment. Constructive similarity assessment treats similarity assessment not as a simple comparison between fixed new and old cases, but as a process for deciding which types of features should be investigated in the new situation and, if the features are borne out by other knowledge, added to the description of the current case. Constructive similarity assessment does not merely compare new cases to old: using prior cases as its guide, it dynamically carves augmented descriptions of new cases out of memory.

@techreport{ leake92constructive,
    author = "David Leake",
    title = "Constructive Similarity Assessment: Using Stored Cases to
       Define New Situations",
    year = "1992",
    url = "citeseer.nj.nec.com/60240.html" }

Neal Richter - December 1, 2000 - Genetic Algorithms for Learning Bayesian Networks

Learning Bayesian Networks from Incomplete Data using Evolutionary Algorithms
James W. Myers, Kathryn B. Laskey, Kenneth A. DeJong

This paper describes an evolutionary algorithm approach to learning Bayesian networks from incomplete data. This problem is characterized by a huge solution space with a highly multi-modal landscape. State-of-the-art approaches all involve using deterministic approaches such as the expectation-maximization algorithm. These approaches are guaranteed to find local maxima, but do not explore the landscape for other modes. Our approach evolves the structure of the network and the missing data. We use a factorial design to choose a good set of parameters for selection, crossover, and mutation. We show that our algorithm produces accurate results for a classification problem with missing data.

Zuzana Gedeon - November 17, 2000 - Suffix Trees

A practical Suffix-Tree Implementation for String Searches
Bogdan Dorohonceanu and Craig Nevill-Manning
Paper Copy Only

See also:
Direct link to the article with the abstract and source code for their implementation of Suffix trees with Java:

Suffix trees are used for string searches. Our authors describe how to build a generalized suffix tree data structure using as few hardware resources as possible while still approaching the time complexity derived in theory. Additional resources include aa700.txt (listings) and aa700.zip (source code).

Revisit: Grouper: A Dynamic Clustering Interface to Web Search Results
Oren Zamir and Oren Etzioni

Revisit: Fast and Intuitive Clustering of Web Documents
Oren Zamir Oren Etzioni Omid Madani Richard M. Karp

Gary Boone - 9:00am - 10:00am - Thursday, November 9, 2000 - Text Analysis

Extreme Dimensionality Reduction for Text Learning: Cluster-generated Feature Spaces
Gary Boone
An exciting array of information management applications would be possible if computers could effectively understand text-based information. Significant progress has been made by employing statistical and machine learning techniques, but fundamental limitation of these vector space approaches is the large dimensionality of the resulting learning space. With typical domain vocabularies and therefore representational dimensionalities containing 10,000-40,000 words, applicable learning techniques are limited. Feature selection techniques can reduce these values, but performance typically suffers as words are eliminated below a few thousand feature words.

Cluster-generated Feature Spaces is a new method that generates features with fast and accurate performance with fewer than 100 features and in many cases less than 10. The approach emphasizes learning a new feature space into which training and query data are subsequently projected. In contrast to feature selection approaches, no feature words are discarded. As a result, dimensionality can be reduced while avoiding the problem of "feature invisibility" which occurs when no feature words are present in a document. The method has the advantages that it is faster than comparable techniques, results in equal or more accurate performance of classification tasks, and enables the application of a broad range of learning techniques in the new, low dimensional space.

Performance results will be shown for email, newsgroup, and web page classification tasks. Comparisons to other common techniques will be given, including Naive Bayes, Information Retrieval clustering and non-clustering techniques, Latent Semantic Analysis, and Support Vector Machines.

By clustering training data in text learning tasks, we can create more effective features than commonly used term features. These cluster features dramatically reduce the dimensionality of the learning space, allowing the use of inductive algorithms that are intractable in high dimensions. The result is equal or better accuracy in classification tasks and faster classification times. Additionally, because no features are discarded in the feature generation process, the problem of feature invisibility is avoided. The process operates by first creating k cluster features, then projecting the training data into the k-dimensional cluster features space. A small k allows inductive algorithms with compact representations and rapid performance, such as Bayesian neural nets, to be used. Applying the cluster features approach to classification tasks for standardized newsgroup and real-world email data sets, we can reduce the learning spaces to fewer than 10 dimensions.

Steve Durbin - November 3, 2000 - Clicktrack Data Analysis

I will consider ways we might extract more information from user clicktrack data in order&#151;adapt as appropriate to go beyond consensus recommendations based on an average of all previous user activity. I'll also indicate how visualization tools like Mathematica may be helpful in developing effective algorithms. A summary and paper describing one approach is at http://www.ics.uci.edu/~icadez/projects/weblog/.

Visualization of Navigation Patterns on a Web Site Using Model-Based Clustering
Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth, and Steven White

We present a new methodology for visualizing navigation patterns on a Web site. In our approach, we rst partition site users into clusters such that only users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach we employ is model based (as opposed to distance based) and partitions users according to the order in which they request Web pages. In particular, we cluster users by learning a mixture of first-order Markov models using the Expectation Maximization algorithm. Our algorithm scales linearly with both number of users and number of clusters, and our implementation easily handles millions of users and thousands of clusters in memory.

Lou Glassy - October 20, 2000 - Markov Chains

The Practice of Programming
Brian W. Kernighan and Rob Pike
Chapter 3: Design and Implementation
No online version.

Calculating Markov Chains using text.. can be used to produce readable unique text.

Neal Richter - October 13, 2000 - Learning Belief Networks

Learning Accurate Belief Nets
Wei Zhou and Russell Greiner

Bayesian belief nets (BNs) are typically used to answer a range of queries, where each answer requires computing the probability of a particular hypothesis given some specified evidence. An effective BN-learning algorithm should, therefore, learn an accurate BN, which returns the correct answers to these specific queries. This report first motivates this objective, arguing that it makes effective use of the data that is encountered, and that it can be more appropriate than the typical "maximum likelihood" algorithms for learning BNs. We then describe several different learning situations, which differ based on how the query information is presented. Based on our analysis of the inherent complexity of these tasks, we define three algorithms for learning the best CPtables for a given BN-structure, and then demonstrate empirically that these algorithms work effectively.

@misc{ zhou99learning,
  author = "W. Zhou and R. Greiner",
  title = "Learning accurate belief nets",
  text = "W. Zhou and R. Greiner. Learning accurate belief
     nets. Technical report, UofAlberta CS, 1999.",
  year = "1999",
  url = "citeseer.nj.nec.com/article/zhou99learning.html" }

Darin Rambo - October 6, 2000 - Interface/Usability

The Trials and Tribulations of Building an Adaptive User Interface
Benhamin Korvemaker & Russel Greiner

As every user has his own ideosyncracies and preferences, an interface that is honed for one user may be problematic for another. To accommodate a diverse range of users, many computer applications therefore include an interface that can be customized&#151;e.g., by adjusting parameters, or defining macros. This allows each user to have his "own" version of the interface, honed to his specific preferences. However, most such interfaces require the user to perform this customization by hand&#151;a tedious process that requires the user to be aware of his personal preferences. We are therefore exploring adaptive interfaces, that can autonomously determine the user's preference, and adjust the interface appropriately. This paper reports a series of experiments towards building such an adaptive interface&#151;here a Unix shell that can predict the user's next command based on his previous interactions, and use this to simplify the user's future interactions. After summarizing the Davison/Hirsh (1998) work (for learning "command stubs"), we then explore several ways of extending and improving this system; e.g., to predict entire command lines, to use various other types of information, etc.

@misc{ korvemaker99trials,
  author = "B. Korvemaker and R. Greiner",
  title = "The trials and tribulations of building an adaptive user
  text = "Benjamin Korvemaker and Russell Greiner. The trials and
     tribulations of building an adaptive user
     interface. http://www.cs.ualberta.ca/greiner/PAPERS/AdaptUI.ps,
  year = "1999",
  url = "citeseer.nj.nec.com/113705.html" }

Edited Adaptive Hypermedia: Combining Human and Machine Intelligence to Achieve Filtered Information
Kristina Hook, SICS, Asa Rudstrom and Annika Waern.

We discuss a novel approach to filtering of hypermedia information based on an information broker and user environment coupled together. The advantage of the proposed approach, edited adaptive hypermedia , is that it combines human expertise with machine intelligence in order to achieve high quality of the filtered information provided to the end users.

@misc{ hook97edited,
  author = "K. Hook and A. Rudstrom and A. Waern",
  title = "Edited adaptive hypermedia: Combining human and machine
     intelligence to achieve filtered information",
  text = "K. Hook, Asa Rudstrom, and A. Waern. Edited adaptive
     hypermedia: Combining human and machine intelligence to achieve
     filtered information. In Workshop on Flexible Hypertext,
     Apr. 1997.",
  year = "1997",
  url = "citeseer.nj.nec.com/hook97edited.html" }

More stuff just for giggles

Lumiere Project: Bayesian Reasoning for Automated Assistance a.k.a. "the annoying little microsoft helper thing"

Doug Warner - September 29, 2000 - Knowledge Representation

Similarity, Structure, and Knowledge: A Representational Approach to Assessment
Peder J. Johnson, Timothy E. Goldsmith, Kathleen W. Teague
No Abstract, Paper copy only

Bikramjit Banerjee - September 22, 2000 - Master's Thesis

Fast Convergence of Concurrent Reinforcement Learners
Bikramjit Banerjee

When several agents learn concurrently, the payoff received by an agent is dependent on the behaviors of the other agents. As the other agents learn and adapt their strategies, the reward of one agent becomes non-stationary. Such a non-stationary environment violates the assumptions underlying the convergence-proofs of single-agent Reinforcement t-Learning techniques. As a result, the standard RL techniques like Q-learning are not guaranteed to converge in a multi-agent environment. Furthermore, the focus of convergence for multiple concurrent learners is on an equilibrium strategy-profile rather than the optimal strategies for individual agents. M. Littman introduced the minimax-Q algorithm that guarantees convergence to equilibrium Q-values in the limit, in zero-sum games. This was extended to general-sum domains by Hu&Wellman. These two techniques are shown to be identical in zero-sum games. The notion of minimax-Q is extended to the general-sum domains and studied experimentally compared to the Hu-Wellman algorithm. Popular speed-up techniques like Qlambda and SARSA are used to augment minimax-Q algorithm and experimentally verified to expedite minimax-Q. The convergence of the new SARSA-minimax-Q algorithm, to minimax-Q values, is proved under appropriate assumptions. Finally, we look beyond Nash-Equilibria as the only acceptable point of convergence in general-sum games. We experimentally show that even naive learners can get higher pay offs, depending on the game-structure and argue in favor of coordination mechanisms to attempt convergence to non-Nash-Equilibrium but higher-payoff (also called non-myopic) equilibrium points.

Neal Richter - September 1, 2000 - Grouper Web Clustering/Searching

Grouper II Meta Search engine

Grouper: A Dynamic Clustering Interface to Web Search Results
Oren Zamir and Oren Etzioni

Users of Web search engines are often forced to sift through the long ordered list of document "snippets" returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on most major search engines. The NorthernLight search engine organizes its output into "custom folders" based on pre-computed document labels, but does not reveal how the folders are generated or how well they correspond to users' interests.

In this paper, we introduce Grouper&#151;an interface to the results of the HuskySearch meta-search engine, which dynamically groups the search results into clusters labeled by phrases extracted from the snippets. In addition, we report on the first empirical comparison of user Web search behavior on a standard ranked-list presentation versus a clustered presentation. By analyzing HuskySearch logs, we are able to demonstrate substantial differences in the number of documents followed, and in the amount of time and effort expended by users accessing search results through these two interfaces.

@article{ zamir99grouper,
    author = "Oren Zamir and Oren Etzioni",
    title = "Grouper: A Dynamic Clustering Interface to Web Search
    journal = "WWW8 / Computer Networks",
    volume = "31",
    number = "11-16",
    pages = "1361-1374",
    year = "1999",
    url = "citeseer.nj.nec.com/zamir99grouper.html" }

Fast and Intuitive Clustering of Web Documents
Oren Zamir Oren Etzioni Omid Madani Richard M. Karp

Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing the results of a retrieval [4]. A person browsing the clusters can discover patterns that would be overlooked in the traditional ranked-list presentation.

In this context, a document clustering algorithm has two key requirements. First, the algorithm ought to produce clusters that are easy-to-browse&#151;a user needs to determine at a glance whether the contents of a cluster are of interest. Second, the algorithm has to be fast even when applied to thousands of documents with no preprocessing.

This paper describes several novel clustering methods, which intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. We report on experiments that evaluate these intersection-based clustering methods on collections of snippets returned from Web search engines. First, we show that word-intersection clustering produces superior clusters and does so faster than standard techniques. Second, we show that our O(n log n) expected time phrase-intersection clustering method produces comparable clusters and does so more than two orders of magnitude faster than word-intersection.

@inproceedings{ zamir97fast,
    author = "Oren Zamir and Oren Etzioni and Omid Madani and Richard
       M. Karp",
    title = "Fast and Intuitive Clustering of Web Documents",
    booktitle = "Knowledge Discovery and Data Mining",
    pages = "287-290",
    year = "1997",
    url = "citeseer.nj.nec.com/zamir97fast.html" }

Ganesh Prabu - August 25, 2000 - Classifying for Hierarchically Clustering - Finding Closest Documents

Hierarchically classifying documents using very few words
Daphne Koller, Mehran Sahami

The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text classification where the there is a large number of classes and a huge number of relevant features needed to distinguish between them. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to utilize more complex (probabilistic) models, without encountering many of the standard computational and robustness difficulties.

@inproceedings{ koller97hierarchically,
    author = "D. Koller and M. Sahami",
    title = "Hierarchically Classifying Documents Using Very Few
    booktitle = "Proceedings of the Fourteenth International
       Conference on Machine Learning ({ICML}-97)",
    year = "1997",
    url = "citeseer.nj.nec.com/article/koller97hierarchically.html" }

An Optimal Algorithm for closest pair Maintenance
Sergei N. Bespamyatnikh

Given a set S of n points in k-dimensional space , and an Lt metric the dynamic closest pair problem is defined as follows: find a closest pair of S after each update of S (the insertion or the deletion of a point). For fixed dimension k and fixed metric Lt, we give a data structure of size O(n) that maintains a closest pair of S in O(log n)time per insertion and deletion

Sergei N. Bespamyatnikh. An optimal algorithm for closest pair
In Proc. 11th Annu. ACM Sympos. Comput. Geom., pages 152-161, 1995.

Lou Glassy - August 11, 2000 - Information Space Topology

Chapter 3, Lou's Master's Thesis
by Lou Glassy {Montana State University}

Neal Richter - August 4, 2000 - E-Mail Management/Routing

Concept Features in Re:Agent, an Intelligent Email Agent
by Gary Boone {Georgia Institute of Technology}

An important issue in the application of machine learning techniques to information management tasks is the nature of features extracted from textual information. We have created an intelligent email agent that can learn actions such as filtering, prioritizing, downloading to palmtops, and forwarding email to voice-mail using automatic feature extraction. Our agent's new feature extraction approach is based on first learning concepts present within the mail, then using these concepts as features for learning actions to perform on the messages. What features should be chosen? This paper describes the concept features approach and considers two sources for learning conceptual features: groups defined by the user and groups defined by the agent's task. Additionally, features may be defined by vectorized examples or keywords. Experimental results are provided for an email sorting task.

@misc{ boone-concept,
  author = "Gary Boone",
  title = "Concept Features in Re:Agent, an Intelligent Email Agent",
  url = "citeseer.nj.nec.com/123300.html" }
  Boone, Gary. (1998).
Concept features in Re:Agent, an intelligent email agent.
Proceedings of the Second International Conference on Autonomous
New York: ACM Press, 141-148.


A Comparison of Classifiers and Document Representations for the Routing Problem
by Heinrich Schutze, David A. Hully, Jan O. Pedersen {XEROX Research}

In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification tech-niques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 10-15% better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks.

Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to over-fitting. We use two different strategies, latent semantic in-dexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against over fitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.

@inproceedings{ schutze95comparison,
    author = "Hinrich Schutze and David A. Hull and Jan O. Pedersen",
    title = "A Comparison of Classifiers and Document Representations
       for the Routing Problem",
    booktitle = "Research and Development in Information Retrieval",
    pages = "229-237",
    year = "1995",
    url = "citeseer.nj.nec.com/schutze95comparison.html" }

Doug Warner - July 28, 2000 - Summarization

Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries
by Michael J. Witbrock et. al.


Learning and Representing Topic
by Thomas Hofmann

This paper presents a novel statistical mixture model for natural language learning in information retrieval. The described learning architecture is based on word occurrence statistics and extracts hierarchical relations between groups of documents as well as an abstractive organization of keywords. To train the model we derive a generalized, annealed version of the Expectation-Maximization (EM) algorithm for maximum likelihood estimation. The benefits of the model for interactive information retrieval and automated cluster summarization are experimentally investigated.

@misc{ mixture-learning,
  author = "Hierarchical Mixture",
  title = "Learning and Representing Topic",
  url = "citeseer.nj.nec.com/119738.html" }

Ganesh Prabu - July 21, 2000 - Efficient Clustering

BIRCH: An Efficient Data Clustering Method for Very Large Databases
by Tian Zhang et. al.

Finding useful patterns in large datasets has attracted considerable in terest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensiona l dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.

This paper presents a data clustering method named BIRCH (Balanced Iterative Red ucing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constr aints). BIRCH can typically find a good clustering with a single scan of the dat a, and improve the quality further with a few additional scans. BIRCH is also th e first clustering algorithm proposed in the database area to handle "noise" (da ta points that are not part of the underlying pattern) effectively.

We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance compa risons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

Tian Zhang, Raghu Ramakrishnan, and Miron Livny.
BIRCH: An Efficient Data Clustering Method for Very Large Databases.
In Proceedings of the 1996 ACM SIGMOD International Conference on
   Management of Data,
pages 103-114, Montreal, Canada, 1996

CURE: An Efficient Clustering Algorithm for Large Databases
by Sudipto Guha et. al.

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering a lgorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shap es and the shrinking helps to dampen the effects of outliers. To handle large da tabases, CURE employs a combination of random sampling and partitioning. A rando m sample drawn from the data set is first partitioned and each partition is part ially clustered. The partial clusters are then clustered in a second pass to yie ld the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large data bases without sacrificing clustering quality.
(Note: This is a large file and should be printed on a fast printer or during a time when printer demand is low.)

@inproceedings{ guha98cure,
    author = "Sudipto Guha and Rajeev Rastogi and Kyuseok Shim",
    title = "{CURE}: an efficient clustering algorithm for large
    pages = "73-84",
    year = "1998",
    url = "citeseer.nj.nec.com/article/guha98cure.html" }

Neal Richter - July 14, 2000 - Text Categorization & Data Mining

Combining Classifiers in Text Categorization
by Leah S. Larkey and W. Bruce Croft.

Three different types of classifiers were investigated in the context of a text categorization problem in the medical domain: the automatic assignment of ICD9 codes to dictated inpatient discharge summaries. K-nearest-neighbor, relevance feedback, and Bayesian independence classifiers were applied individually and in combination. A combination of different classifiers produced better results than any single type of classifier. For this specific medical categorization problem, new query formulation and weighting methods used in the k-nearest-neighbor classifier improved performance.

@inproceedings{ larkey96combining,
    author = "Leah S. Larkey and W. Bruce Croft",
    title = "Combining classifiers in text categorization",
    booktitle = "Proceedings of {SIGIR}-96,
       19th {ACM} International Conference on Research and Development
       in Information Retrieval",
    publisher = "ACM Press, New York, US",
    address = "Z{\"{u}}rich, CH",
    editor = "Hans-Peter Frei and Donna Harman and Peter
       Sch{\"{a}}uble and Ross Wilkinson",
    pages = "289-297",
    year = "1996",
    url = "citeseer.nj.nec.com/larkey96combining.html" }

A Framework for Comparing Text Categorization Approaches
by Isabelle Moulinier.

For the past few years, text categorization has emerged as an application domain to machine learning techniques. Several approaches have already been proposed. This paper does not present yet another technique. It is rather an attempt to unify the approaches encountered so far. Moreover this state-of-the-art enables us to stress a shortcoming in earlier research: the lack of evaluation of inductive learners in the categorization process. We present a first attempt to remedy this lack. We expose an experimental framework, that fits in with our unified view of text categorization methods. This framework allows us to conduct a set of tentative experiments in order to assess which characteristics allow a learner to perform well on the text categorization task.

@misc{ moulinier96framework,
  author = "I. Moulinier",
  title = "A Framework for Comparing Text Categorization Approaches",
  text = "Isabelle Moulinier. A Framework for Comparing Text
    Categorization Approaches. In AAAI Spring Symposium on Machine
    Learning and Information Access, Stanford University, March
  year = "1996",
  url = "citeseer.nj.nec.com/moulinier96framework.html" }
Connected from IP: