Welcome Guest ( Log In | Register )

Reply to this topicStart new topic
> John Scott's May 2004 Sandbox explanation revisited
Michael Martinez
post Feb 7 2006, 06:06 PM
Post #1


Group: Members
Posts: 865
Joined: 25-January 04
From: Houston
Member No.: 2

On May 6, 2004, John Scott wrote the following at Cre8asite's forum:

My source for that minus out thing, and the idea of links being on probation, is a fellow who used to work with Krishna Bharat at another company. (Not Google.)

The probation does not apply to new sites. It applies to links. When the algorithm was deployed certain older links were grandfathered in. After that, links will be (are being) given partial credit, and be essentially on "probation".

It applies to links, not sites. And the age of the link is not the only factor. The IP range of the links and other considerations are made, and the person who I discussed this with said that Krishna Bharat is at Google primary to develop and implement this new algorithm. It is supposed to radically change the way links are evaluated.

QUOTE(Forum Member)

Hi, so how long would new links be on "probation"?

I have no idea. The fellow I talked to said that the link probation was set in phases, and it is entirely up to Google to determine at what point a link should be given full credit.


Also, I want to reiterate that I am totally ignorant in this matter. It's just something I heard from a "guy who knows a guy".

He has been quoted many times since then, but as often happens with these kinds of posts, there has been a severe lack of corroborating information. How can we confirm what he said?

First of all, let's set aside the issue of IP range for a moment. Almost every reference to IP range draws in the Hilltop algorithm (which Krishna Bharat co-developed), and Google has made it clear they are not tossing out sites on the basis of IP range. We can look at the possible impacts of IP range upon links later.

There is very little authoritative information out there (on the Web) about how Google actually looks at links. Virtually every SEO-guru or guru-wannabe has written a (usually very wrong) tutorial on PageRank and the Sandbox. None of them will serve any useful purpose in this discussion as they all lack objective credibility (many of them are certainly touted as authorities by members of the SEO community, but such accolades provide no objective credibility).

Our objectively credible sources of information are, as usual, Google representatives and the Google Web site itself.

First off, let me quote Matt Cutts on topics pertaining to link reputation:

What is an update? Google updates its index data, including backlinks and PageRank, continually and continuously. We only export new backlinks, PageRank, or directory data every three months or so though. (We started doing that last year when too many SEOs were suffering from �B.O.�, short for backlink obsession.) When new backlinks/PageRank appear, we�ve already factored that into our rankings quite a while ago. So new backlinks/PageRank are fun to see, but it�s not an update; it�s just already-factored-in data being exported visibly for the first time in a while.
Source: http://www.mattcutts.com/blog/whats-an-update/

As others have noted, if you're going to sell text links that pass reputation/PageRank, the way to do it is to add rel=nofollow to those links.

Tim points out that these these links have been sold for over two years. That's true. I've known about these O'Reilly links since at least 9/3/2003, and parts of perl.com, xml.com, etc. have not been trusted in terms of linkage for months and months. Remember that just because a site shows up for a "link:" command on Google does not mean that it passes PageRank, reputation, or anchortext.

Google's view on this is quite close to Phil Ringnalda's. Selling links muddies the quality of the web and makes it harder for many search engines (not just Google) to return relevant results. The rel=nofollow attribute is the correct answer: any site can sell links, but a search engine will be able to tell that the source site is not vouching for the destination page.

Posted by: Matt Cutts at August 24, 2005 09:31 AM
Source: http://radar.oreilly.com/archives/2005/08/...engine_s_2.html

The best links are not paid, or exchanged after out-of-the-blue emails�the best links are earned and given by choice. When I recap SES from my viewpoint, I�ll give some examples of great ways to earn links.
Source: http://www.mattcutts.com/blog/seo-mistakes...xchange-emails/

But for everyone else, let me talk about why we consider it outside our guidelines to get PageRank via buying links. Google (and pretty much every other major search engine) uses hyperlinks to help determine reputation. Links are usually editorial votes given by choice, and link-based analysis has greatly improved the quality of web search. Selling links muddies the quality of link-based reputation and makes it harder for many search engines (not just Google) to return relevant results. When the Berkeley college newspaper has six online gambling links (three casinos, two for poker, and one bingo) on its front page, it�s harder for search engines to know which links can be trusted.

At this point, someone usually asks me: �But can�t you just not count the bad links? On the dailycal.org, I see the words �Sponsored Resources�. Can�t search engines detect paid links?� Yes, Google has a variety of algorithmic methods of detecting such links, and they work pretty well. But these links make it harder for Google (and other search engines) to determine how much to trust each link. A lot of effort is expended that could be otherwise be spent on improving core quality (relevance, coverage, freshness, etc.). And you can imagine how the people trying to get link popularity have responded. Someone forwarded me an email from a �text link broker� that included this suggestion:
Source: http://www.mattcutts.com/blog/text-links-and-pagerank/

The article also implies that avatarfinancial.com is ranking higher because Rand Fishkin bought some backlinks. We�ve already covered this territory. Rand, those paid links from the Harvard Crimson and elsewhere aren�t helping the site. In fact, it looks like you bought links from the same network that the other two sites at the site clinic were buying from. And I doubt Rand was expecting any direct PageRank impact from Avatar�s prweb.com press release. But what is helping is good content like the articles about non-conforming loans and the new blog on that site. That�s why when I see strong links from Yahoo�s directory, Dmoz, and Wikipedia to Avatar, I�m not very surprised.
Source: http://www.mattcutts.com/blog/seo-article-in-newsweek/

Weary, Yahoo links are helpful because they�re high PageRank, but that�s the only reason; there�s no special �Yahoo boost� or edu-boost or gov-boost. Those links just tend to be higher quality.
Source: Ibid.

Next, let me quote the ubiquitous GoogleGuy:

GoogleGuy Says: [Link to quote]
IITian, I appreciate your suggestion. People just need to know that new pages start out at zero and work their way up over time as more PageRank/links point to their site. It's just like anything else--when you're new, nobody knows your reputation. Over time, people build up trust as they get to know you. A low PageRank for a new page is nothing to worry about.
Source: http://www.markcarey.com/googleguy-says/ar...reputation.html

Finally, let me quote a few things from Google's even more ubiquitous Web site:


1. How often will Google crawl my site?

Google's spiders regularly crawl the web to rebuild our index. Crawls are based on many factors such as PageRank, links to a page, and crawling constraints such as the number of parameters in a URL. Any number of factors can affect the crawl frequency of individual sites.

Our crawl process is algorithmic; computer programs determine which sites to crawl, how often, and how many pages to fetch from each site. For tips on maintaining a crawler-friendly website, please visit our Webmaster Guidelines.


3. Why is my site labeled "Supplemental"?

Supplemental sites are part of Google's auxiliary index. We're able to place fewer restraints on sites that we crawl for this supplemental index than we do on sites that are crawled for our main index. For example, the number of parameters in a URL might exclude a site from being crawled for inclusion in our main index; however, it could still be crawled and added to our supplemental index.

The index in which a site is included is completely automated; there's no way for you to select or change the index in which your site appears. Please be assured that the index in which a site is included does not affect its PageRank.
Source: http://www.google.com/webmasters/faq.html

2. My page's location in the search results keeps changing.

Each time we update our database of webpages, our index invariably shifts: we find new sites, we lose some sites, and sites' ranking may change. Your rank naturally will be affected by changes in the ranking of other sites. No one at Google hand adjusts the results to boost the ranking of a site. The order of Google's search results is automatically determined by many factors, including our PageRank algorithm, and is described in more detail here.

You might check to see if the number of other sites that link to your URL has decreased. This is the single biggest factor in determining which sites are indexed by Google, as we find most pages when our robots crawl the web, jumping from page to page via hyperlinks. To find a sampling of sites that link to yours, try a Google link search.

3. My pages don't return for certain keywords.

Google does not manually assign keywords to sites, nor do we manually "boost" the rankings of any site. The ranking process is completely automated and takes into account more than 100 factors to determine the relevance of each result.

If you'd like your site to return for particular keywords, include these words on your pages. Our crawler analyzes the content of webpages in our index to determine the search queries for which they're most relevant. If your site clearly and accurately describes your topic and many other websites link to yours, it'll likely return as a search result for your desired keywords.
Source: http://www.google.com/webmasters/4.html

The software behind Google's search technology conducts a series of simultaneous calculations requiring only a fraction of a second. Traditional search engines rely heavily on how often a word appears on a web page. Google uses PageRank� to examine the entire link structure of the web and determine which pages are most important. It then conducts hypertext-matching analysis to determine which pages are relevant to the specific search being conducted. By combining overall importance and query-specific relevance, Google is able to put the most relevant and reliable results first.
Source: http://www.google.com/corporate/tech.html

Well, there is a lot of high-level conceptual information there, but not much in the way of details. Nonetheless, I think we can reliably summarize a few key facts (including some not mentioned above):
  • Links confer Pagerank, Reputation, and Anchor Text
  • Google uses links to determine how important individual documents are
  • Google values freely given, editorially selected links over paid links
  • Google arbitrarily devalues paid links that it detects (isolating "portions" of sites to do so)
  • Google updates its backlinks and PageRank far more often than it reports ("exports")
  • Google crawls pages more frequently based (at least in part) on PageRank
  • New pages start out (in the index?) with no PageRank
  • Documents listed in the auxiliary ("Supplemental") index acquire PageRank normally
  • Google combines importance with relevance to determine rankings
  • In July 2005, Google filtered a huge number of automated directory-like sites from their listings
  • Since acquiring Blogger.com, Google has removed a huge number of automated blogs (spam blogs)
NOTE: When I use the term "PageRank", I am not referring to the meaningless fluff number displayed by the Google Toolbar.

Now, most people who speak about the Sandbox feel that new sites are being automatically held back by an aging delay or penalty until they prove themselves. And yet, a few sites (including the very high profile walken2000.com) have broken out of the Sandbox very rapidly when they've accumulated a large number of links that their Webmasters did not generate through purchases, link-dropping, or networking.

When the July 2005 update occurred, many Webmasters reported sudden drops in rankings. Earlier in the year, I and others had openly complained about these faux Web sites (which came in 3 varieties: scraper sites, XML/RSS-driven pseudo-directory sites, and DMOZ clones) cluttering search results. The faux pages had high link interconnectivity and were artificially boosting their own PageRank. But they were also boosting the PageRank of many innocent sites. Those innocent sites suffered when they suddenly lost a large number of backlinks.

The implications of Google's changes in behavior, the comments from Matt Cutts and Googleguy, and the official Google explanations are pretty clear: new documents (with respect to the index) have no PageRank, so they are neither going to rank well nor help other documents' rankings.

I infer that a first crawl for any document confers no PageRank. It simply gets the document into the index. So, what happens to the PageRank associated with whatever links bring Google to new documents? Is it discarded or redivided among the remaining outbound links on the source document?

I infer that before Google assigns PageRank, it has to crawl a document through a trusted link -- that is, a link embedded on a document that has accrued PageRank. This inference is congruent with the oft-voiced speculation that generating thousands of new links is a red flag. It's not necessarily a red flag so much as a waste of time because if the links are placed on documents with little to no PageRank, they won't confer any PageRank to the new documents they point to.

I infer that a link may not confer any PageRank the first time it appears on a document regardless of how old the document is. Instead, it may be that Google waits for the link to be recrawled to confirm or verify that it points to the same document. If that is the case, then that means Google maintains an extensive database about links. Hey, that sounds familiar....

Of course, patent watchers will be quick to point out that Google patent 20050071741 describes methods for capturing history about documents and links. Okay, we may have some indirect evidence to support the suggestion that Google has incorporated some or all of this patented technology into its primary search service (much of the patent seems more directly related to Google's Personal Search service).

Still, the average age of documents that the patent addresses is an interesting factor. It appears to confirm the possibility for Google to use an aging delay when ranking documents. A document may have to surpass an age threshold.

And just as the patent looks at document ages and aging, so it looks at link ages and aging. Links may in fact come and go on a document. They often do come and go on many dynamic documents such as blogs, forum pages, directory pages, news pages, etc. Link volatility could be used to determine how trustworthy a document is with respect to the links it provides. How many non-spam documents change links on a frequent basis? There must be some interesting metrics regarding link transitions on static and dynamic documents.

It seems to me that there is no way to accurately test what John Scott reported at Cre8asite. Google has had plenty of time to refine whichever part of its algorithm is responsible for what we tend to call the Sandbox Efect. IP address ranges may or may not have been used at first to identify suspect links, for example. Would such criteria be necessary now? Should they have been used to begin with? After all, many Web sites reside on the same IP addresses in shared hosting environments and they are never aware of (do not link to) each other.

However, I have noticed over the past year or so that links from older sites tend to help boost new pages out of the Sandbox Zone pretty quickly. As Mike Grehan observed in his Filthy Linking Rich article from October 2004, "The richer you are with links pointing back to your site, the richer you are likely to become in search marketing terms."

An older site is more likely to have established PageRank, and it's more likely to be trusted, and it's more likely to confer PageRank (and Reputation and Anchor Text). But an older site is also more likely to be crawled more frequently than a newer site. Hence, if Google gets to a new document through any link, it follows that the first crawl confers nothing to the new document but still brings it into the index. Then, as Google recrawls older documents, when it finds a link to the new document on one of those older documents, that link may confer PageRank the first time through or it may only confer PageRank after the first crawl.

That is, you may need for Google to reach your new site several times through links from documents that normally confer PageRank as a way of validating your new document. It may be that new documents have to accumulate the equivalent of validation points in terms of being crawled through trusted inbound links, but those links have to be crawled more than once in order to be trusted.

No one has yet shown any correlation between Toolbar PageRank and the obscure Trust factor that many of us have inferred Google is using. Some people assume this Trust factor is derived from the Trustrank process that Google trademarked, but TrustRank is not a document valuation tool. It is a document filtering process that separates the wheat from the chaff.

That is, TrustRank is only used to determine "the likelihood that pages are reputable" (good). However, TrustRank is very untrustworthy. While there is some evidence that Google uses TrustRank to track news stories, the inherent deficiencies of the algorithm make it unsuitable for the main index.

What I call Trusted Content is more sharply defined than the results TrustRank will return. Even so, the Good Core/Bad Core concepts may only help Google figure out which high PageRan sites can be trusted to confer PageRank, Reputation, and Anchor Text. These methodologies are both inadequate for helping evaluate the majority of Web sites.

So we're thrown back to looking at document age and link age, neither of which is very helpful in itself. But link age may at least be used to judge the trustworthiness of a link. Call this a factor used to determine LinkTrust, a value that determines whether a link will be allowed to confer PageRank, Reputation, and Anchor Text. Google can turn off the LinkTrust, and so can Webmasters (through rel=nofollow).

With LinkTrust, you can develop Content Trust or Validation by setting a threshold for Trusted Links in the document's backlink set. The sooner that threshold is reached, the sooner the document is allowed to confer PageRank, Reputation, and Anchor Text through its own outbound links.

It may be that we fail to see these pages rise through the search results only because they lack the right kinds of inbound links -- links that confer sufficient quality value to separate them from the spam documents that Google wants to permanently exclude. A spam filter that qualifies on links actually addresses the problem more directly, as it singles out the one factor that spammers are consistently dependent upon to achieve rankings: inbound linkage. While some spammers pay attention to on-page content, others do not. But virtually all of them build up huge numbers of backlinks (as the v7ndotcom elursrebmem spam contest John Scott set into motion shows).

My conclusion is that we still don't have a smoking gun. We don't know what actually produces the Sandbox Effect. Google engineers have reportedly looked at their algorithm and identified the cause, conceding that the effect while unintentional provides some value to them. Nonetheless, I think it's time people go back and reconsider what John Scott had to say. Maybe, after all, he was on the money or very close to it. If so, a lot of people need to adjust their search optimization strategies.

The Sandbox Effect won't be going away any time soon as long as Google likes what it sees.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Michael Martinez
post Feb 8 2006, 12:28 AM
Post #2


Group: Members
Posts: 865
Joined: 25-January 04
From: Houston
Member No.: 2

NOTE: "walken2000.com" should have been "walken2008.com".


User is offlineProfile CardPM
Go to the top of the page
+Quote Post
post Feb 8 2006, 09:04 AM
Post #3

Designer & Programmer

Group: Admin
Posts: 1125
Joined: 26-January 04
From: Dallas, Texas
Member No.: 8

Very well written Michael. Once again you add great value to spider-food's forum archive.

I think the most important thing is the Editorially given links. For example a link from an About.com article would be a very good link which may boost your trustrank way up and pull you out of the sandbox. Why, well you know that before an article gets published there it's going to go through fact checkers and editors and then posted. It's not just a spur of the moment, shoot from the hit page.

I'd also know that Matt Cutts implied that Press releases are detectable and so anyone thinking that a link from prweb or other news outlets with good reputation will be deemed as a trusted link.

Just my 2 cents. Great read!
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Michael Martinez
post Feb 8 2006, 09:23 AM
Post #4


Group: Members
Posts: 865
Joined: 25-January 04
From: Houston
Member No.: 2

Thanks, Mike. I'd intended to say something about the press releases but had so many interruptions while writing that yesterday I just plum forgot.

User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Michael Martinez
post Feb 9 2006, 07:36 AM
Post #5


Group: Members
Posts: 865
Joined: 25-January 04
From: Houston
Member No.: 2

For people who don't like reading patents, here is a quick run-down of factors that may be used to create independent scores. Each of these scores may be used to influence a ranking, so it appears that Google has developed a Score Convergence methodology along with a Decay Rate Scoring methodology.
  • Document Inception Date (see sections 0034-0044)
  • Content Updates/Changes (see sections 0046-0056)
  • Query analysis (see sections 0058-0065)
  • Link-based Criteria (see sections 0067-0080)
  • Anchor text (see sections 0082-0086)
  • Traffic (see sections 0088-0091)
  • User behavior (see sections 0093-0095)
  • Domain-related information (see sections 0097-0102)
  • Ranking history (see sections 0104-0112)
  • User maintained/generated data (see sections 0114-0117)
  • Unique words, bigrams, phrases in anchor text (see sections 0119-0121)
  • Linkage of independent peers (see sections 0123-0125)
  • Document topics (see sections 0126-0129)

Some of these scoring techniques would most easily be applied in the Personal Search service. They would be difficult to implement in the broader main index search. However, for all we know, Big Daddy's new architecture makes it possible for Google to implement everything here across all their search services. For all we know, one of the major updates they've rolled out over the past couple of years implemented all the stuff in this patent.

The Link-based criteria section most directly relates to this discussion. The main points made in this section are:
  • "the link-based factors may relate to the dates that new links appear to a document and that existing links disappear"
  • "Using this date as a reference, search engine 125 may then monitor the time-varying behavior of links to the document, such as when links appear or disappear, the rate at which links appear or disappear over time, how many links appear or disappear during a given time period, whether there is trend toward appearance of new links versus disappearance of existing links to the document, etc."
  • "By analyzing the change in the number or rate of increase/decrease of back links to a document (or page) over time, search engine 125 may derive a valuable signal of how fresh the document is."
  • "search engine 125 may monitor the number of new links to a document in the last n days compared to the number of new links since the document was first found. Alternatively, search engine 125 may determine the oldest age of the most recent y % of links compared to the age of the first link found."
  • "In another exemplary implementation, the metric may be modified by performing a relatively more detailed analysis of the distribution of link dates."
  • "According to another implementation, the analysis may depend on weights assigned to the links. In this case, each link may be weighted by a function that increases with the freshness of the link....In order to not update every link's freshness from a minor edit of a tiny unrelated part of a document, each updated document may be tested for significant changes ... and a link's freshness may be updated (or not updated) accordingly."
  • "links may be weighted based on how much the documents containing the links are trusted (e.g., government documents can be given high trust). Links may also, or alternatively, be weighted based on how authoritative the documents containing the links are (e.g.," non-Toolbar PageRank). Freshness may also be used.
  • "Search engine 125 may raise or lower the score of a document to which there are links as a function of the sum of the weights of the links pointing to it." A lot of fresh links can help an old document look fresh again.
  • "the dates that the links to a document were created may be determined and input to a function that determines the age distribution."
  • "The dates that links appear can also be used to detect 'spam,' where owners of documents or their colleagues create links to their own document for the purpose of boosting the score assigned by a search engine. A typical, 'legitimate' document attracts back links slowly."
  • "the analysis may depend on the date that links disappear."
  • "the analysis may depend, not only on the age of the links to a document, but also on the dynamic-ness of the links. As such, search engine 125 may weight documents that have a different featured link each day, despite having a very fresh link, differently (e.g., lower) than documents that are consistently updated and consistently link to a given target document."

User is offlineProfile CardPM
Go to the top of the page
+Quote Post

Reply to this topicStart new topic


- Lo-Fi Version Time is now: 9th March 2006 - 02:04 AM