|Oingo Meaning-Based Search Technology|
The rapid advance of computer technology, and in particular, the growth of the World Wide Web, has resulted in the availability of ever-increasing amounts of information. It is said that "information is power", but the stark reality is that information is useless if one has no way to find what one needs to know. It is more accurate to state that "the ability to find information is power".
The meteoric rise of search-related companies and "portal" sites on the Web is clear evidence of the truth of this fact. And yet users are increasingly dissatisfied with the results they get from search engines. As the Web grows, the task of retrieving and presenting relevant information becomes increasingly difficult. Current search techniques are simply not sufficient: clearly new methods of information retrieval are needed.
Oingo’s patent-pending search technology centers on an innovative approach to information retrieval; namely, the idea of a "meaning-based" search. Instead of simply indexing words that appear in target documents and allowing searchers to find desired word instances, searches are conducted within the realm of "semantic space", allowing searchers to locate information that is "close in meaning" to the concepts they are interested in. Greater relevancy of results is achieved because results that are not relevant to the subject searched upon can be filtered out. And results that would have been missed in a traditional plain text search can be retrieved – simply because a certain word does not appear on a page does not preclude its relevance to the subject of interest.
The initial objective of our search technology is to enable Web users to easily locate subject categories within Mozilla’s large and rapidly growing Open Directory, through the most convenient and simplest means possible. A "target document" in our system therefore corresponds to a single subject page within the Open Directory. Arranged in a roughly hierarchical fashion, the directory consists today of approximately 160,000 unique topics (but is growing very rapidly).
By allowing users to refine their searches to specific meanings of words, we enable users to filter out irrelevant material in their search, and therefore achieve more precise results (higher "precision"). For example, a search on the word "bulls" would normally bring up a large amount of information relating to the Chicago Bulls basketball team. But if the user was actually only interested in another meaning of the word "bulls", like bulls as a kind of cattle, they may easily refine their search to this specific meaning, and block out all of the irrelevant basketball-related results.
Because searches pull in the conceptual areas "near" a particular meaning, we also present the user with categories that are likely to be of interest, yet might have been missed by a traditional search approach (higher "recall"). An example would be a result of "cattle" to a search on "bulls", which would come up as a result because the two concepts are "near" each other in semantic space.
The value of Oingo's meaning-based method goes well beyond the mere filtering of irrelevant results. The true power of our technology is demonstrated with a query such as "shopping for fishing gear". Once exhausting a search for Web sites containing all three words, a traditional text-based search engine resorts to looking for two-of-three word matches. This search could yield results about shopping and gear, but having nothing at all to do with fishing, or conversely about fishing gear, but nothing to do with shopping! An Oingo meaning-based search does not give up so easily; it essentially tries hundreds of possible combinations of related terms before giving up on finding information related to all three concepts. Consider the following examples of highly relevant results for this query: "Buying Fishing Equipment Online", "Fishing Equipment Retail Shops", and "Gifts for Fishing Enthusiasts". A traditional text-based search cannot possibly see the high relevancy of these results unless all three of the specific search words just happen to appear together on the page. Because of this, traditional text-based search results can seem arbitrary, even random, at times. Replace the word "employment" with "jobs", for example, and you will typically get entirely different results. By searching on meanings, instead of just words, our search eliminates this "randomness" of results.
Before a meaning-based search can be implemented, it is necessary to architect a "semantic space" within which searches will be performed. Our semantic space is defined by a large network of interconnected meanings that we refer to as the Oingo Lexicon. Created by a combination of automated procedures and manual editing, we believe the Lexicon we have constructed to be among the most comprehensive datasets of its type in existence today.
Of key importance to the definition of a semantic space is the accurate determination of "semantic distance" within this space. Or, in other words, deciding how far apart two concepts are. Clearly "cat" and "dog" are more similar concepts than, say, "cat" and "umbrella". The question is, exactly how similar are they? We achieve good values for these estimations by a complex analysis of relationships within the Lexicon, coupled with algorithms that reflect the way the human mind assesses connections between different ideas.
Once a rich semantic space has been well-defined, the next stage in the process involves identifying the "meaning" of each target document that exists within the space to be searched. This could be thought of as determining where each target document should be placed within semantic space. In simple terms, it’s really just "deciding what it’s about". This is achieved by an automated "sensing" process that analyzes text that appears on the document, the output of which is verified for accuracy by human editors.
When it comes time to perform a search on this meaning-indexed data, the first crucial step is to identify what meanings the user is looking for. Individual words that the user enters are analyzed, and a prediction is made as to the likelihood that certain concepts are relevant to the search. In cases where the entered search words can have more than one meaning, disambiguation between meanings is performed by comparing words against each other. For example, if the user enters "turkey", the default assumption may be that they are likely to be interested in "turkey - the food". However, if they also entered the word "Istanbul", the meaning "Turkey - the country" will be favored instead, because of the discovered connection between the concepts.
A first-pass search is run against all possible meanings of the entered words, with results weighted by probabilities of relevancy. However, the true power of the meaning-based search becomes apparent on the second pass search, in which the user specifies exactly which meanings she is interested in. This is achieved by simply selecting the desired meanings from a list of the possible meanings that were implied by the first-pass search. The results of this "refined search" yields results that have been perfectly filtered to reflect only results that are relevant to the specific concepts the user is interested in.
Advanced features of the Oingo search engine include the ability to require that all results are relevant to a specific entered meaning. This is analogous to a Boolean AND operation that many modern search engines feature. By entering a plus sign (+) before a term, the user can indicate that a result must "hit" either the plain text of that term, or a meaning that is semantically similar to the meaning implied by that term. The addition of more complex Boolean operations to searches is currently under development.
To improve relevancy of results, and improve ordering of those results, a plain text word search is performed in parallel to the Oingo meaning search. This is similar to a standard search engine text search; however, when performed in concert with a meaning search, the relevancy of returned results is greatly enhanced.
Another interesting differentiating factor the user will find when comparing the Oingo directory to other Web directories, is that a search can be performed from a specific directory subject page within the area of semantic space that is close to that topic. Typically, other directories allow users to search "within the category" that matches the page they are currently viewing. This simply means that search results are restricted to topics that exist below that point in the subject hierarchy. Oingo’s model is fundamentally different: because we index directory topics by meaning, we are not constricted by the standard hierarchical directory model. This means that we can see connections between topics, even when they appear in different branches of the directory hierarchy. Users can take advantage of this by doing a "Near Topic" search from any topic page, which pulls up relevant results wherever they are in the system, whether they are "below" in the hierarchy or not.
One more valuable feature we are able to provide is the ability for users to filter out search results that contain adult-oriented material. Because we know what indexed pages are "about", we know if these pages concern subjects that are related to topics that users may wish to block children’s access to. In the future, we will be providing users with the ability to filter out results that are close in meaning to other chosen topics. For example, all results that are close in concept to "guns", might be automatically eliminated, should the user so desire.
Oingo will continue to develop meaning-based search technology: we believe that the importance of this work will become increasingly apparent in the very near future. The value of meaning-based search techniques in improving information retrieval methods on the rapidly growing World Wide Web is clear, but the possible applications of this kind of research are wide-ranging and are sure to have an impact on a number of different fields. We are interested in exploring the many new opportunities that arise with the growth of information technology as a whole.