All right, this is driving me nucking futz. It appears to me that the Adobe specification for ASCII-85 Encoding in the Postscript Reference Version 3 is either wrong, or incomplete. I have been struggling with this thing, presuming that this is another one of my brain farts, that as soon as I open my mouth (or blog as the case may be), my stupidity will become plainly apparent. But I cannot see how I am misinterpretting or misunderstanding the specs.
For the unitiated, ASCII-85 Encoding is an encoding algorithm to take binary data and represent it in plain ascii text, like Base64 etc., ASCII-85 is superior in that it is a 4:5 input to output ratio compared to the others than expand more.
The basic functioning is to take your 8 bit data, 4 bytes at a time. Call them bytes a,b,c,d. Basically turn these into an unsigned int by the following: value = a*256^3+b*256^2+c*256+d. This value is then encoded into 5 base85 digits, call them v,w,x,y,z as follows:
v= (value / 85^4) % 85
w = (value / 85^3) % 85
x = (value / 85^2) % 85
y = (value / 85) % 85
z = value % 85
In theory, the modulus is superflous on the v value, but i’ve included it anyway. Before output, the value 33 (’!') is added to each digit to ensure that it falls within the printable ascii range.
Decoding is then the inverse, given inputs of v,w,x,y,z where 33 has already been subtracted, you get: value = v*85^4 + w*85^3 + x*85^2 + y*85 + z. To return to the original inputs of a,b,c,d the following applies:
a = (value / 256^3)%256
b = (value / 256^2) % 256
c = (value / 256) % 256
d = value % 256
Whitespace is ignored on input, and ~ followed by > indicates end of data. So far, so good. This works just hunky dorey. The issue is when you have remaining data input of size < 4 digits. According to the spec, you pad the input data with zeros to get a 4-tupple input, encode as normal but only output n + 1 digits where n is the number of input digits. The spec says "This information is sufficient to correctly encode the number of final bytes and the values of those bytes". While technically this is true, it is not true if you are only to follow the terms of the specification.
I will demonstrate. Say you have an input file of a single character: ".". Just a period, nothing else (I know, hardly needs encoding but that is beside the point). "." has an ascii value of 46. So our input value (with zero padding per the spec) becomes: value = 46*256^3 + 0 *256^2 + 0*256 + 0 = 771751936. Using base 85 to encode we get:
v = (value / 85^4) % 85 = (7717551936 / 85^4) % 85 = 14
w = (value / 85^3) % 85 = (7717551936 / 85^3) % 85 = 66
x = (value / 85^2) % 85 = (7717551936 / 85^2) % 85 = 56
y = (value / 85) % 85 = (7717551936 / 85) % 85 = 74
z = value % 85 = 7717551936 % 85 = 46
Now we only have 1 character of input, so according to the spec, we need to output two characters, 14 and 66, or '/' and 'c' after adding 33. However, if you try to decode using just 14 and 66, you get the following:
value = 14*85^4 + 66*85^3 = 771341000
a = (value / 256^3) % 256 = (771341000 / 256^3) % 256 = 45
Ummmm, 45 != 46. WTF?
Now, you can tweek this to make it work, but since the spec makes no reference to this being necessary, where the hell should it happen? On the Encoder? Or on the Decoder? Which one is supposed to account for this boundary condition and why the hell isn't it documented in the spec? I have done some searching on this and I can't find anyone saying that this is a deficiency in the specification. I have found several projects with tons of cvs entries with patch after patch to fix borken ascii85 encoding/decoding algorithms, which strikes me as odd given how simple it is. I have found several implementations on the net that will fail to on this very condition. One I found violates the spec by including n+2 bytes of output. It encodes (this should be simply the word “and” without a link here, but seeing as wordpress sucks shit directly from the ass, I cannot get it to unlink an and between two links, because it f’ing knows better, so here you get a subblog entry ranting about the ass suckage that is wordpress…. ) and decodes its own data just fine, but mine would bork decoding its output since 3 chars of input should mean 2 chars of output according to the spec, not one. How is it that this hasn’t been noticed by anyone? I have to believe its me, not the spec, but I just can’t see where I am f’ing it up. Anyone with a clue, please give me a whack.