After reading several of the hits on a quick google search, it seems there is not a whole lot of consistency when it comes to determining average URL length.

I know IE has a maximum URL length of 2083 characters (from here) - so I have a good maximum to work with.

My concern is that I am writing a URL-shortener in PHP (similar to some other questions on SO), and want to make sure I am not likely to exceed the storage capability of the server hosting it.

If all URLs are the IE maximum, then 2^32 won't fit comfortably anywhere - it'd take 2K x 4B ~= 8TB of storage: an unrealistic expectation.

Without adding-in a trimming function (ie, purging "old" shortened URLs), what is the safest way to calculate storage usage of the app?

Is ~34 characters a safe guess? If so, then a fully-populated (using an int type for a primary key) database would chew 292GB of space (double 146GB for any meta data that may want to be stored).

What is the best-guess for an application such as this?

share|improve this question
    
You want to store 2 billion urls? – Ted Hopp May 29 '11 at 17:18
    
@Ted Hopp - I'm looking at worst-case maxima, not what I truly anticipate – warren May 29 '11 at 20:14
up vote 2 down vote accepted

Well, you don't need to know the avarage url length. It is a guess, but I'd figure that an URL shortener is mainly used to shorten long URLs. Why bother shortening one that is short already? :)

That said, there's another issue. A database will have some overhead too, so you can't just calculate an avarage and said that is the avarage byte size.

I've written an url shortener myself and it already contains about 45 items. So I'd suggest you write yours, and by the time it actually contains 2^32 URLs, buying an 8TB hard disk will probably not pose a problem anymore. ;-)

share|improve this answer

This is probably unknowable without indexing the entire Internet, but according to an analysis by Kelvin Tan on a dataset of 6,627,999 unique URLs from 78,764 unique domains, the answer is 76.97:

Mean: 76.97

Standard Deviation: 37.41

95th% confidence interval: 157

99.5th% confidence interval: 218

share|improve this answer

I'm not sure what is typical, but of 11,000 urls in our request database, the average length is 62 characters. We may be an exception because every month we receive hundreds of requests from our customer for items from Japan. Our database includes hundreds of urls with several hundred characters. The longest is a google translate link at 1689 characters.

top 10 len(producturl): 1689 792 707 693 647 606 574 569 562 560

sample url 647 characters:

http://www.amazon.co.jp/%E9%AD%94%E7%95%8C%E6%88%A6%E8%A8%98%E3%83%87%E3%82%A3%E3%82%B9%E3%82%AC%E3%82%A4%E3%82%A24-%E5%88%9D%E5%9B%9E%E9%99%90%E5%AE%9A%E7%89%88-%E5%A0%95%E5%A4%A9%E4%BD%BF%E3%83%95%E3%83%AD%E3%83%B3-%E3%83%97%E3%83%AD%E3%83%80%E3%82%AF%E3%83%88%E3%82%B3%E3%83%BC%E3%83%89%E4%BB%98%E3%81%8D%E7%89%B9%E8%A3%BD%E3%82%AB%E3%83%BC%E3%83%89-%E3%83%88%E3%83%AC%E3%83%BC%E3%83%87%E3%82%A3%E3%83%B3%E3%82%B0%E3%82%AB%E3%83%BC%E3%83%89%E3%80%8C%E3%83%B4%E3%82%A1%E3%82%A4%E3%82%B9%E3%82%B7%E3%83%A5%E3%83%B4%E3%82%A1%E3%83%AB%E3%83%84%E3%80%8D%E9%99%90%E5%AE%9APR%E3%82%AB%E3%83%BC%E3%83%89%E4%BB%98%E3%81%8D/dp/B0043RT8UO/ref=pd_rhf_p_t_1

for estimating purposes you should extrapolate from some dataset after applying standard deviation to throw out the outliers which could distort your mean.

share|improve this answer

From RFC 2068 section 3.2.1:

The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15).

Note: Servers should be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations may not properly support these lengths.

Although IE (and probably most other browsers) support much longer URI lengths, I don't believe most forms or client-side apps rely on anything above 255 bytes working. Your server logs should provide some statistics about what kind of urls you are seeing.

share|improve this answer
    
I think it’s rather the target URL length he is worrying about. – Gumbo May 29 '11 at 17:09
    
Well, yes, but who generates the target URL? Typically, long URLs come from links generated by web apps, or by url-encoded form submissions, javascript, or other client-side code. – Ted Hopp May 29 '11 at 17:21
1  
But how should observing the server’s logs solve this? I don’t think he wants to shorten his own server’s URLs. – Gumbo May 29 '11 at 17:33
    
I have no idea what he wants to shorten. I just suggested that as a way to collect raw data about url/uri lengths. But you're right, it might be data that is irrelevant to what he's trying to do. – Ted Hopp May 29 '11 at 18:04

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.