Adobe’s ASCII-85 Encoding, is it me or is it wrong?

October 4th, 2006 by Outtascope

All right, this is driving me nucking futz. It appears to me that the Adobe specification for ASCII-85 Encoding in the Postscript Reference Version 3 is either wrong, or incomplete. I have been struggling with this thing, presuming that this is another one of my brain farts, that as soon as I open my mouth (or blog as the case may be), my stupidity will become plainly apparent. But I cannot see how I am misinterpretting or misunderstanding the specs.

For the unitiated, ASCII-85 Encoding is an encoding algorithm to take binary data and represent it in plain ascii text, like Base64 etc., ASCII-85 is superior in that it is a 4:5 input to output ratio compared to the others than expand more.

The basic functioning is to take your 8 bit data, 4 bytes at a time. Call them bytes a,b,c,d. Basically turn these into an unsigned int by the following: value = a*256^3+b*256^2+c*256+d. This value is then encoded into 5 base85 digits, call them v,w,x,y,z as follows:

v= (value / 85^4) % 85

w = (value / 85^3) % 85

x = (value / 85^2) % 85

y = (value / 85) % 85

z = value % 85

In theory, the modulus is superflous on the v value, but i’ve included it anyway. Before output, the value 33 (’!') is added to each digit to ensure that it falls within the printable ascii range.
Decoding is then the inverse, given inputs of v,w,x,y,z where 33 has already been subtracted, you get: value = v*85^4 + w*85^3 + x*85^2 + y*85 + z. To return to the original inputs of a,b,c,d the following applies:

a = (value / 256^3)%256

b = (value / 256^2) % 256

c = (value / 256) % 256

d = value % 256

Whitespace is ignored on input, and ~ followed by > indicates end of data. So far, so good. This works just hunky dorey. The issue is when you have remaining data input of size < 4 digits. According to the spec, you pad the input data with zeros to get a 4-tupple input, encode as normal but only output n + 1 digits where n is the number of input digits. The spec says "This information is sufficient to correctly encode the number of final bytes and the values of those bytes". While technically this is true, it is not true if you are only to follow the terms of the specification.

I will demonstrate. Say you have an input file of a single character: ".". Just a period, nothing else (I know, hardly needs encoding but that is beside the point). "." has an ascii value of 46. So our input value (with zero padding per the spec) becomes: value = 46*256^3 + 0 *256^2 + 0*256 + 0 = 771751936. Using base 85 to encode we get:

v = (value / 85^4) % 85 = (7717551936 / 85^4) % 85 = 14

w = (value / 85^3) % 85 = (7717551936 / 85^3) % 85 = 66

x = (value / 85^2) % 85 = (7717551936 / 85^2) % 85 = 56

y = (value / 85) % 85 = (7717551936 / 85) % 85 = 74

z = value % 85 = 7717551936 % 85 = 46

Now we only have 1 character of input, so according to the spec, we need to output two characters, 14 and 66, or '/' and 'c' after adding 33. However, if you try to decode using just 14 and 66, you get the following:

value = 14*85^4 + 66*85^3 = 771341000

a = (value / 256^3) % 256 = (771341000 / 256^3) % 256 = 45

Ummmm, 45 != 46. WTF?

Now, you can tweek this to make it work, but since the spec makes no reference to this being necessary, where the hell should it happen? On the Encoder? Or on the Decoder? Which one is supposed to account for this boundary condition and why the hell isn't it documented in the spec? I have done some searching on this and I can't find anyone saying that this is a deficiency in the specification. I have found several projects with tons of cvs entries with patch after patch to fix borken ascii85 encoding/decoding algorithms, which strikes me as odd given how simple it is. I have found several implementations on the net that will fail to on this very condition. One I found violates the spec by including n+2 bytes of output. It encodes (this should be simply the word “and” without a link here, but seeing as wordpress sucks shit directly from the ass, I cannot get it to unlink an and between two links, because it f’ing knows better, so here you get a subblog entry ranting about the ass suckage that is wordpress…. ) and decodes its own data just fine, but mine would bork decoding its output since 3 chars of input should mean 2 chars of output according to the spec, not one. How is it that this hasn’t been noticed by anyone? I have to believe its me, not the spec, but I just can’t see where I am f’ing it up. Anyone with a clue, please give me a whack.

Posted in Comp, Java |

6 Responses

  1. Outtascope Says:

    Well, it seems that always adding one to the least significant included base-85 byte during encoding (and carrying where necessary) uniformly solves the problem. I am still unenthused, however, as no matter how I parse that specification I cannot read that act into it. I think the rule works though, hopefully there are no boundary exceptions but I am running a check on it as I write this. 117,440,512 down, 4,177,526,784 to go…..
    I’m sure there is a simple mathematical formula to prove this but by the time I figure that out, the iterative test would be done anyway.

  2. Outtascope Says:

    Heh, that was dumb. I was testing the four byte case unnecessarily. At least I figured it out before it finished :). Results are that it passes a round trip encode and decode, with decoding being the algorithm posted in the spec and encoding being the lsB+1 method above for all 1,2 and 3 ending byte cases.

  3. Outtascope Says:

    Alex Cherepanov over on the Adobe forums points out that Ghostscript handles this during decoding by adding a value of 84 in the next n-tupple spot before decoding, and indeed this does work. The plus one on encoding technically violates the spec as that would not be “in the usual way”. It wouldn’t have killed them to make a statement about the condition in the spec though.

  4. Andy Robb Says:

    I’ve seen this before in my own base85 encoding (using the encoding table from RFC1924 rather than Adobe’s ASCII sub-range).

    Pad the data with 0xff rather than 0×00 and you avoid the integer truncation when dividing by 85 to encode.

  5. rkippen Says:

    I found the same thing. The spec is wrong.

    I tested the suggestion by Andy, padding with FF instead of 0 works.

    The test cases were:
    1. array length = 1, test 256 combinations
    2. array length = 2, test 256^2 combinations
    3. array length = 3, test 256^3 combinations

    for each array {
    String s = encode(array)
    array2 = decode(s)
    compare(array, array2)

  6. rkippen Says:

    I take that back about padding with FF.

    The specification says to pad with 0, and padding with FF will change the output. The only option left is to try to make the decoding account for the truncation.

    > Ghostscript handles this during decoding
    > by adding a value of 84 in the next n-tupple spot

    I believe this is wrong. The value should be 85. During my testing, if the input array is [ 116 ], then the
    string F8 is produced. So let’s decode the F8 string using 84 trick versus 85 trick:

    { (’F’ - 33), (’8′ - 33) } = { (70 - 33), (56 - 33) } = { 37, 23 }

    If we use the 84 trick, then the array becomes: { 37, 23, 84, 0, 0 }

    37*85^4 + 23*85^3 + 84*85^2 + 0 + 0 = 1946154900
    converting to binary: 01110011 11111111 11110111 10010100

    The first 8 bits = 115, which is wrong.

    However, if 85 is used:

    37*85^4 + 23*85^3 + 85*85^2 + 0 + 0 = 1946162125
    converting to binary: 01110100 00000000 00010011 11001101

    The first 8 bits = 116, which is correct.

    Using 85 passed all the tests I did.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.