Counting Clicks and Looking at Links
At their core, the major search engines use what I call the location/frequency method of determining relevancy. For example, search for "bill clinton," and most will return pages primarily ranked by where and how often those words appear in each document.
To be more specific, a page titled "Bill Clinton's Life" is likely to be considered more relevant than others where the title tag doesn't mention the US president's name. That's an example of how the location of a term can be important. Similarly, a page that repeatedly mentions Bill Clinton probably will get more of a boost than a page with only one reference.
I'm grossly over simplifying the process, of course. Location and frequency are not the only factors used. Each search engine has a blend of techniques that go into their algorithms. But location and frequency have tended to be the dominant factors.
Some new techniques may be about to change that. The idea of leveraging links as a means to improving results is making a comeback. And later this month, one search engine is going to enhance its service with Direct Hit, which taps into user feedback to improve relevancy.
Direct Hit works in the background, quietly watching what users search for, then recording which pages they visit from the "normal" search results. Over time, it develops enough data to know which pages are popular and which aren't.
To use this information, a user selects the Direct Hit option, which will likely appear above the regular search results. This will bring up Direct Hit's own list of what's relevant, where pages are ranked by user popularity.
The system is ideally suited to general queries of one or two words, which are common on the major services. A suggestion to try Direct Hit will probably only appear in response to short queries like these, similar to how the RealNames option appears only for queries of three words or less at AltaVista. The Direct Hit option will also only appear if enough data on a term has been gathered.
Some problems immediately come to mind. Spamming is foremost. Can site owners simply click their pages to the top? The chief defense against this is the sheer amount of data that Direct Hit samples, which makes it hard to skew things. Those attempting to do so are likely to spotted. There are also some other tricks Direct Hit has to help control spamming.
Another problem is the fact that many users don't search deep. "Only about 7% of users really go beyond the first three pages of results," says Gary Culliss, Chairman and Founder of Direct Hit. How can the system bring the good stuff to the top if users never dig initially to find it?
"All it takes is one person to find something buried deep in the results list to start its movement upward where it can be viewed by other searchers and boosted further in its ranking," said Culliss.
In fact, top listed sites that are not visited can move down in the Direct Hit ratings, while sites buried in the results enjoy a significant boost if someone drills down and selects them.
"You can view it in the negative sense of whatever people pass over gets moved down, or in the positive sense of whatever they click on moves up," Culliss said.
I ran some quick comparisons of Direct Hit against results from the search engine it will soon appear on. All results were at least a little better, and with some queries, they were dramatically improved. A search for "microsoft," for example, put the company home page, the Internet Explorer page and a software download page at the top.
I've no doubt Direct Hit will be popular as a supplement, and not necessarily just on one search engine. The company is talking with other players and positioning its technology as a non-exclusive addition that any of them can use.
While Direct Hit leverages humans directly, Clever leverages them indirectly, via the links they create.
You may have heard of Clever through some scattered press coverage recently given to its core technology, HITS. HITS stands for Hypertext-Induced Topic Search, and it was developed by Cornell University researcher Jon Kleinberg, while he was a visiting scientist at IBM's Almaden Research Center.
IBM has expanded and enhanced HITS into Clever, a system that ranks pages primarily by measuring links between them.
The process starts by collecting a set of pages relevant for a particular term. For example, Clever might send a query to AltaVista for "bill clinton" and then retrieve the top 200 pages listed. Next, Clever gathers all the pages that the initial 200 link to, plus any pages on the web that link to them.
The result is a set of a few hundred to a few thousand pages, which Clever ranks by counting links. Pages in the set with the most links pointing at them get the best scores, but only initially.
"Links have noise, and it's not always clear cut which pages are best," said Kleinberg. "We wondered, was there a way to get some sort of consensus out of the links?"
The solution is to recalculate the scores, this time letting links from important pages carry more weight. To paraphrase Animal Farm, all links are created equal, but some are more equal than others.
Picture it in real life. A link from a page within Yahoo should mean more than a link from someone's personal page, since the criteria to be listed is much higher. Likewise, links from other "important" sites should carry more weight.
The challenge is helping the algorithm understand what pages are "important." That's where the initial ranking comes in. Pages with the most links are established as most important, and during the recalculation, their links transmit more weight.
This produces a completely new set of scores and even allows for situations where a page with only one link to it could do better than another with two links, if that single link is from a very important page.
Repeating this recalculation a number of times further refines scores. Nor does Clever stop there. A series of other tweaks are also made to help improve relevancy.
A key component is to consider text within and near the link. If the actual search term appears, then that link transmits more weight to the page it points at. Clever also discounts the weight of links between pages at the same web site.
The end result of this is a list of top ranked pages. However, IBM doesn't see Clever being used to provide real-time search results in response to queries. Instead, they feel the value will be to create constantly refreshed lists of relevant pages for categories.
Specifically, imagine the situation at Yahoo, where there are thousands of different subjects. Clever researchers believe their technology could be used to populate these categories with minimal human assistance. Give Clever a few terms relevant to the subject, and it will leverage links to fill the category with best pages on the web.
"You don't have to have an army of ontologists to stay current," said Prabhakar Raghavan, a researcher on the Clever project.
So how good are the results? The system isn't available for testing outside of IBM's firewall, but in a recent study IBM conducted, Clever's results were as good or better than Yahoo's results 81% of the time.
That's IBM's research, of course, but I think it's pretty trustworthy. One need only look at Google to see how effective links can be in improving relevancy.
The last time some students at Stanford University got involved with categorizing the web, it turned into a little site you may have heard of called Yahoo. That alone makes me surprised someone hasn't yet swooped in to carry off Google developers Larry Page, Sergey Brin and Craig Silverstein into portal heaven. Even more so is the fact that the engine they've put together is really good and even has a catchy name.
Google is an experimental search engine that, like Clever, uses weighted link popularity as a primary part of its ranking mechanism. Each page has a rank, based on the number of other pages linking to it and the importance of those pages. Importance, as with Clever, is derived from an overall link count.
Google also makes extensive use of the text within hyperlinks. This text is associated with the pages the link points at, and it makes it possible for Google to find matching pages even when these pages cannot themselves be indexed.
An important difference from Clever is that Google actually crawls the web itself, rather than analyzing a core set of pages from another search engine. Thus, its results should be more comprehensive. Over 25 million pages have been indexed, and the goal is to gear up toward 100 million or more.
Google also provides some ranking boosts on page characteristics. The appearance of terms in bold text, or in header text, or in a large font size is all taken into account. None of these are dominant factors, but they do figure into the overall equation.
So how about the results? I think many people will be pleased, especially for the ever-popular single and two-word queries. A search for "bill clinton" brought the White House site up at number one. A search for "disney" top-ranked disney.com, and sections within it like Disney World, the Disney Channel, and Walt Disney Pictures. Yet interesting alternative sites, such as Werner's Unofficial Disney Park Links, also made it on the list.
Will Google be going commercial? Page has no opposition to it, but said there's no particular hurry.
"We're Ph.D. students, we can do whatever we want," he said. And what they want is to find the right partners to let them focus on improving relevancy. "I'd like to build a service where the priority is on giving users great results," Page said.
If you pay a visit, don't be frightened by the interface. One thing Google needs is a good facelift. Relevancy scores and other extraneous information can obscure the actual listings -- but I did say this was an experimental service, right?
The value of links hasn't been lost on the major search engines, of course. Infoseek chairman Steve Kirsch recently said that link data was a core component of his service's new retrieval algorithm. And Excite has long made use of link data as part of its ranking mechanism.
"Most in the industry would recognize that the links pointing into a site give you a fair idea of the visibility of the site and its prominence," said Excite search product manager Kris Carpenter.
It's likely that there will continue to be a growing emphasis on non-traditional data such as links and user feedback to make sense of the web, as opposed to just the words on the page. It's essential in a web universe where the text on some pages cannot be trusted, and where other pages cannot be indexed at all.
Google WebBase Research Pages
Those interested in how to build a search engine will enjoy this goldmine of information. Be warned -- it's all highly technical.