HOW MANY WEBSITES WENT DARK?: An Educated Guess (v 1.0)
By Michael A. Norwick
As you probably know, during the 48-hour (or more) period following President Clinton's signing of the Telecommunications Bill, hundreds of webmasters decided to turn their web pages black (following the lead of the Voters Telecomm Watch and the Center For Democracy and Technology) in order to protest the bill's restrictions on free speech on the Internet. I found this relatively spontaneous activism to be quite profound, and was amazed by the response. Given the fact that the protest was only planned for 48 hours I thought it imperative to somehow attempt to quantify just how pervasive the "thousand points of darkness" movement was during the fleeting hours following the enactment of the Communications Decency Act. This semi- formal study, conducted by myself, was an attempt to scientifically estimate how many web pages participated in this protest.
Between 5:00 PM Friday, February 9th and 3:00 AM Saturday February 10th, I accessed 610 URLs by clicking on Yahoo!'s (www.yahoo.com) "Random Yahoo! Link" selector. Of these 610 URLs, I systematically chose not to include three types of sites in calculating the results of this survey: 1) sites that were no longer up and running or could not be reached because of network problems, 2) sites that contained a Non- United States top-level domain, and 3) FTP and gopher sites. If a page had moved to a new URL, but left a link behind, the link was followed and that page was used in the results (assuming the page was not precluded by any of the above screeners). Sites that were black but did not contain a message about the site's black background were rechecked on Thursday, February 16th in order to determine whether the page normally has a black background. Sites that were still black at this time and did not contain any message about the protest were recorded as non-participants in the protest. Other sites that were still dark on February 16th, but contained protest symbols, e.g. blue ribbons, were contacted via email to determine whether they changed their site black during the week of February 8th. The final sample consisted of 420 websites. A list of the URLs that were used in the sample are available HERE.
III. EXPLANATION OF METHODOLOGY AND LIMITATIONS
A. Why Yahoo!?
In order to do a true scientific survey of the percentage of U.S. websites that went dark, it would be necessary to obtain a complete list of U.S. based sites and take a random sample from it. This list, of course, does not exist. In the haste of trying to find the next best thing, I decided that the Yahoo! database probably contained the largest list of websites with the existing ability to produce a simple random sample of websites from its database (a Yahoo! staff member verified that the site selection method is a simple random sample, and that the database was up-to-date). Yahoo! was also a good choice because it contains a fairly diverse range of topics, and it usually leads you to the top-level home page of a website rather than to multiple pages within the same site. For example, if I were to take a random sample from a larger search engine (that "crawls" the web), there might be 50 pages listed all from www.hotwired.com/????. This would skew the sample towards larger sites. There also may be many sites that only darkened their top-level home page.
On the downside, I think it is necessary to be cautious about generalizing U.S. Yahoo! sites to be representative of all U.S. websites. My understanding is that Yahoo! gets most of its URLS from people who submit them to Yahoo's webmasters. Thus, the population of the webmasters who want to be listed in Yahoo! may be different from the rest of the population of webmasters. For example, Yahoo! might over- represent commercial websites that want maximum publicity. Yahoo! also probably has fewer newer websites that have not yet decided to be listed there.
B. Why exclude Non-U.S. Domains?
I now believe this was a mistake. My initial assumption was that Americans should not expect the rest of the world to protest against the actions of the U.S. Government. But while the CDA mostly endangers Americans against criminal penalties, it really affects the whole world in filtering content coming from within the U.S. Halfway through doing the study, I realized that I should have recorded international domains as well and then publish the results of both groups. Unfortunately, I did not keep the data on Non-U.S. based systems. Anecdotally, there were a few darkened Non-U.S. sites that I saw with black backgrounds in the course of selecting a sample, although clearly not at as high a percentage as the U.S. sites (and I am well aware that many international sites participated in the protest based on the list of participants I saw at the VTW website). Unfortunately, I initiated this study fairly spontaneously and I did not thoroughly think everything through.
C. Is it fair to include sites that normally have black backgrounds as non-participants in the protest?
This survey is an analysis of how many sites "went black", not of how many sites "would have gone black, but for the fact that the site's background was already black", or of how many sites "were black" on Friday February 9th. While I suppose all three questions have merit, I thought that the calculation of how many sites "went black" was the most intellectually honest in measuring how successful the protest was.
I make no secret about the fact that I ardently oppose the CDA. I have been very active in opposing censorship on the Internet and I too turned my web pages black. That having been said, I don't believe my beliefs affected the outcome of this study in any way. This survey was not sponsored or initiated by any organization or person other than myself and my main goal was to get an accurate reading of how many websites participated in the protest. The methodology I used was quite mechanical and my personal opinion did not influence the results. Thus, there should be little or no "response error" or error caused by improper data interpretations, biased research techniques, etc. Ultimately, however, if you wish to accept the findings of the study, you will have to trust that I did not fraudulently select the sample.
Of the 420 U.S. based websites included in the sample, 29 sites (7%) turned their pages black in order to protest the Communications Decency Act. There is a statistical error of plus or minus 2 1/2 percentage points using a 95% confidence interval. Thus we can be 95% confident that between 4 1/2 and 9 1/2 percent of all currently operating U.S. based sites that are listed on Yahoo! participated in the protest.
A. How Many Websites on Yahoo! Went Dark?
Unfortunately, because Yahoo! management refused to provide me with some specific data that I requested, it is not possible to precisely estimate how many sites on Yahoo! went dark. However, as of February 9th, there were 202,178 URLs listed on Yahoo! (based on the sum of the numbers Yahoo! puts next to each top-level subcategory). However, because Yahoo! also lists Usenet newsgroups in their catalog (which are not available through the Yahoo! Random Link) I am going to guess that there are about 3000 Usenet groups listed in the Yahoo! database, and subtract that out of the total. Based on my experience that only 420 out of 610 (69%) sites accessed met the criteria for the population I sought to sample, I will estimate that there were about 137,000 U.S.- based websites accessible through Yahoo! on February 9th. This is a conservative estimate considering that many of the sites that I could not reach because of network problems may have been either accessible at other times during the 48-hour protest, or were accessible to networks other than the one that I am on. Based on these assumptions, somewhere in the neighborhood of 9,500 U.S. based websites on Yahoo! participated in the protest. However, even if all my assumptions are dead-on (which I'm sure they're not), the statistical margin of error dictates that this number could be as low as 6,000 and as high as 13,000. Count websites worldwide and I'd guess that these numbers would be at least 5% higher (if a conservative guess that 2% of Non-U.S. sites went black is true).
B. How Many Websites on the Entire Web Went Dark?
Two problems plague any attempt at estimating the total number of sites that went dark based on this survey. First, you have to assume that the population of websites listed on Yahoo! is representative of all websites. As explained in Section III. A., this is unlikely. Second, and far more devastating, is the fact that there is absolutely no accurate estimate of how many "websites" there are on the web. While there are some fairly reliable statistics available on how many web servers there are, and how many http URLs there are, no one really knows how many "sites" there are (a site being a group of related and connected pages run by one person, business or organization). Therefore, any approximation of the total number of sites that went dark based on this study is statistical voodoo. But given the fact these figures will not be used in the cure for cancer, or, god-forbid, aid us in the election of a new president, I don't see any harm in taking a stab at it.
Lycos estimates that there are currently between 22 and 26 million URLs on the web based on their estimate that Lycos's computers have collected somewhere between 75 to 90% of the URL's on the web. Seventeen percent of the servers on Lycos are either FTP or gopher sites, so I will drop the web estimate to 20 million URLs. In order to reduce this number to the nationalistic limitations of my study, I will estimate that 82% of the URLs are U.S.-based (based on some older Yahoo! statistics I obtained, and my own experience in obtaining the sample--based, unfortunately, on sites not URLs). This leaves us with a little over 16 million URLs (this includes non-html documents and some cgi scripts which are both collected by Lycos).
A way of estimating how many websites there are based on URLs is to estimate the average number of URLs per website. While this number could actually be scientifically estimated, doing so would be very time consuming, and is beyond my means. I will make some guesses based on observation. While some websites may have just a single page, with no internal links, many of the larger sites have over a hundred URLs within the same site. I'm going to unscientifically guess that the average number of pages per site is somewhere between 10 and 30. Given this guess, there would be somewhere between 500 thousand and 1.5 million websites in the United States. Assuming 7% of these sites turned their pages black to protest the CDA, the following estimates are applicable:
Low:(500,000 sites)(7%) = 35,000 U.S. sites protesting.
Middle:(1,000,000 sites)(7%) = 70,000 U.S. sites protesting.
High:(1,500,000 sites)(7%) = 105,000 U.S. sites protesting.
Again, I'd add about 5% to the totals to include all sites internationally. Unfortunately, not only are the numbers based on shaky facts, but the estimates vary by a very wide margin.
While the estimates made in this study are not nearly as accurate as I had hoped when I began this project, I hope that they represent a good starting point in calculating the strength of what could safely be called a protest involving tens of thousands of "points of darkness". I welcome others, especially those people involved in calculating Internet statistics, to suggest more accurate estimates, and to make any comments or criticism about my methodology.
Special thanks to Sarah Garnsey at Lycos, Bryan O'Connor at Yahoo!, Ed Kubaitis, Gwenn Gauthier, and everyone else who took my phone calls and answered my e-mail.
ABOUT THE AUTHOR
Michael A. Norwick [email@example.com] is currently a second-year law student at the George Washington University Law School. He holds a Bachelor of Science Degree in Business Administration with a concentration in marketing from Northeastern University. He has worked in the Marketing Research Department of the Stop & Shop Supermarket Company and has also worked as a legal intern for the Electronic Frontier Foundation.
This report has been released into the public domain. Distribute freely.
Return to Michael's Dark Side