Home

 

 

What's in it for YOU

    Ordinary folks

    Philanthropies

    Industry players

 

Documentation

    News

 

FAQ

 

Discussion

    Topics

    Mailing list

    Contact info

 

Related projects

    Keyboard design

    Markup language

    Application software

Frequently Asked Questions

 

“I could not have said it better with a 10-foot pole.”

--Unknown

 

 

1. There is already an existing framework for approving characters considered appropriate for worldwide information interchange, how can Bytext possibly compete with this?

 

2. What about the new Bytext characters that have no UCS equivalent?

 

3. Isn’t it silly to encode emoticons?

 

4. How can Bytext ever gain wide acceptance without ASCII transparency?

 

5. How can Bytext be a serious competitor to Unicode when it is at an embryonic stage of development?

 

6. Why did you create Bytext?

 

7. I’m a developer and Unicode already suits my needs, why should I care about Bytext?

 

8. Are there any security advantages in Bytext?

 

9. How can you use 8 bit regular expressions with Bytext? If it is a multi-byte encoding then shouldn’t it have the same problems that UTF-8 or EUC have when faced with byte-only utilities?

 

10. I can’t seem to understand Bytext, isn’t Unicode simpler?

 

11. Could you give an example of how a word would be encoded in Bytext?

 

12. Is Bytext bidirectionality compatible with Unicode bidirectioniality?

 

13. How is using bidirectionality in Bytext easier than in Unicode?

 

 

1. There is already an existing framework for approving characters considered appropriate for worldwide information interchange, how can Bytext possibly compete with this?

 

Bytext is not a framework for approving characters, it is a framework for encoding characters. The various international organizations tasked with approving characters for the UCS (Universal Character Set) can go about their business as normal. Bytext merely reassigns the scalar values and other encoding details to simplify how the characters are processed. The intended nature of each UCS character is preserved. Bytext is essentially a serialization of Unicode normalization form C. 

 

TOP

 

 

2. What about the new Bytext characters that have no UCS equivalent?

 

These will in fact have a UCS equivalent, they will map to unique sequences of private characters because that is exactly what they are in a UCS context. Private characters are useful to determine the use of a unit of language as a character in plain text. Once they are widely used as characters in plain text, they must be encoded in the UCS so long as they do not violate UCS principles (such as if they were corporate symbols), and they do not. 

 

The vast majority of the new characters in Bytext can be considered to be pattern defined transformations of existing characters, including ignorables and CO subtypes. Most of these transformations are only useful in an integrated format like Bytext, but can be expressed in the UCS by private “transforming” characters. Even these private characters would not inherently violate UCS principles, similar functionality is well illustrated by characters such as variation selectors, combining characters, KHMER SIGN COENG, etc. The few remaining new characters -such as graphical emoticons- once available in a font, can be considered to be on a path for acceptance into the UCS. 

 

TOP

 

 

3. Isn’t it silly to encode emoticons?

 

Graphical emoticons are already in the UCS, so in principle this is a silly question. The UCS is for use by ordinary people, including children, not just by uptight academics and politically correct bureaucrats. Emoticons are useful to encode so one can paste a conversation from a text based messaging system into a plain text editor (and vice versa) without losing information.

 

TOP

 

 

4. How can Bytext ever gain wide acceptance without ASCII transparency?

 

First to define ASCII transparency: a charset with ASCII transparency will encode ASCII characters with the same binary pattern as they would appear in ASCII. ASCII is a 7 bit code, so the 8th bit can be used to extend the charset beyond ASCII. UTF-8 is the transformation format of Unicode that exhibits ASCII transparency.

 

ASCII transparency is different from ASCII compatibility. Most Unicode formats do not have ASCII transparency. In fact, technically speaking, it is Bytext and no form of Unicode that has ASCII compatibility. Whereas Bytext has rearranged ASCII but retained the original meanings of each character; Unicode has changed the meanings of 8 very common ASCII characters as a result of mirroring (see question 13). In particular, Unicode meanings for these 8 characters (and others) are not true to their name. 

 

ASCII transparency is useful for internal purposes when the name of a charset is not explicitly identified with each instance of text such as in file systems and the command line. In the vast majority of such cases the default charset is ASCII, notable exceptions are EBCDIC based systems.

 

Bytext is designed for information interchange, in cases where the charset is explicitly defined. It is easy to imagine a rearranged form of Bytext with ASCII transparency, call it: Rearranged For ASCII Transparency, RFAT. The mnemonic: Americans...  ;-)   Such a format is a potentially useful compromise but was not chosen for the design of Bytext because it would permanently break the consistent design and make things more difficult for users; whereas the benefits only temporarily make things easier for the developer. In a true object oriented environment where text strings are objects with a defined charset, converting ASCII to a charset that is a superset of ASCII (like Bytext) will be trivial and transparent to the user. Keeping track of data along with it’s properties is what’s known as type safety and is widely regarded as good programming practice. More importantly, it is necessary for network communication to eliminate ambiguity.

 

By focusing on the linguistic needs of the user rather than the temporary needs of the developer; and by focusing on the goal of information interchange rather than on what is appropriate for internal formats, Bytext is a singular design framework that is built to last.

 

TOP

 

 

5. How can Bytext be a serious competitor to Unicode when it is at an embryonic stage of development?

 

Because most of the benefits of Bytext can be proven theoretically. Compare this question to the question: At what point did object oriented programming become a serious competitor to procedural programming? To those concerned with the theoretical aspects of computer programming and those forward thinking enough to predict the needs of computer users, it was a serious competitor the moment it was conceived, before there was even an implementation ready to exploit it. To less forward thinking individuals, it was only a serious competitor once there were enough people that could be quoted as saying that it is a serious competitor. Reactionaries always need to have things forced down their throat before they become palatable.

 

Likewise, to those concerned with the theoretical aspects of processing text, Bytext as a technical idea and a format should most definitely be regarded as a serious competitor to the technical ideas that Unicode is based on and the formats that Unicode provides.

 

Also note the answer to question 1, Bytext makes use of what would be private characters in a UCS context, but does not fundamentally compete with the groups tasked with building a consensus of what characters are appropriate for plain text.

 

TOP

 

 

6. Why did you create Bytext?

 

While learning Unicode, I (Bernard Miller) began to think of character encoding as a science, and the various possible ways that text can be encoded became interesting to me. I realized that different encoding methods had many non trivial consequences for how text is processed and used as language. Figuring out all the idiosyncrasies of different writing systems and the various ways they can be encoded became an intellectual challenge, a puzzle that I found enjoyable. 

 

The whole notion of getting it all right the first time seemed incredibly unlikely, and imposing this design on the whole world for all time seemed inherently distasteful. The popularity of fully object oriented design seemed to indicate the inevitability of recording a charset for each text object. This, combined with consideration for the way languages and scripts evolve seemed to indicate that a fundamentally new charset for worldwide information interchange was not only possible but inevitable. Even when text is not a locally defined object in internal systems, it is an appropriate requirement for text meant for interchange between disparate systems.

 

TOP

 

 

7. I’m a developer and Unicode already suits my needs, why should I care about Bytext?

 

Because your users may want the features of Bytext in their text editors. Text processing software capable of processing Bytext will be worth more than otherwise. It is easier and faster to process text in the Bytext format, especially for Brahmic scripts like Devanagari. Current string search algorithms for Unicode (nobody has disputed this yet) still cannot do simple things --that is, simple in Bytext-- like automatically search for all the variants of a Brahmic syllable. 

 

The numerous benefits of Bytext can be appreciated by a broad spectrum of users, from the beginner who likes the faster, more intuitive searching capabilities to the advanced user who may come to depend on the various other features in Bytext such as ignorables and OBS characters. Fast regular expressions is a feature that is unlikely to diminish in practicality over time. Many users will also appreciate how entire directories of documents with markup can be searched for useful information without parsing and without forcing the directory into yet another format that needs to be maintained.

 

TOP

 

 

8. Are there any security advantages in Bytext?

 

If you can accept the notion that you are more secure using simple things that are easier to understand, then yes, Bytext has many security advantages. Bidirectionality in Bytext is not only more consistent and better matches user expectations, but is much more simple to use. This is discussed further, with examples, in question 13. Simple things tend to be more reliable and easier to maintain. 

 

Also, Bytext character properties make it easy for a display component to have a unique display for each text string, which makes it easy for users to verify that the screen display is faithful to the encoding. This helps reduce spoofing, which is a non trivial security concern. Unicode adopts more of a “roll your own” approach toward preventing spoofing. In particular, Unicode does not provide a full set of what Unicode calls “control pictures”. Unicode also does not have an equivalent to the “fallback glyphs” property of Bytext which can be used in several practical ways by implementations, including for the purpose of preventing spoofing. Other security concerns with Unicode are discussed here: http://www.counterpane.com/crypto-gram-0007.html#9

 

TOP

 

 

9. How can you use 8 bit regular expressions with Bytext? If it is a multi-byte encoding then shouldn’t it have the same problems that UTF-8 or EUC have when faced with byte-only utilities?

 

Bytext is a variable length encoding. Basically the sign bit is used to determine character boundaries and the other 7 bits of each byte are used to determine the scalar character value. 8 bit regexes (regular expressions) can be used because the way the scalar values are organized. The whole scalar value is not always needed because the characters that would be represented if only some of the bytes of a character were read are all semantically related. For example, the first byte of a character may represent “lowercase A”. The second byte, inclusive, may represent “uppercase A”. The last byte, inclusive (what the character actually is), may represent “uppercase A with ring above”. If you're looking for any old variation on the theme of “letter A” (including Greek and Cyrillic versions), you just compose a regex that only matches the first byte of each character for a scalar value that matches “lowercase A”. If you want to match only a specific multi byte character, you search for that character the same way you would search for a multi byte word. Despite some complications, it also works with whole syllables from scripts such as Devanagari.

 

TOP

 

 

10. I can’t seem to understand Bytext, isn’t Unicode simpler?

 

Unicode appears simpler on the surface (there are less characters for example), but once all the text processing details are accounted for, Bytext is MUCH simpler. The way characters are organized in Bytext is well defined and sometimes eliminates the need for entire character properties. Some properties can be defined solely in terms of how characters with the properties are organized. All Bytext character properties are modeled by a single table. The bidi algorithm and the line breaking algorithm are vastly simplified. The titlecase property is eliminated while retaining it's functionality. East Asian width properties are described in Unicode with an entire technical report containing 6 properties. In Bytext, there are no properties required in the database for East Asian width, and a functionally equivalent description only requires a single paragraph. 

 

Bytext can be thought of as an exercise in massive precomposition, an attempt to eliminate the need for combining characters, spelling conventions, and grapheme clusters. Precomposition is the spirit of the W3C Character Model, Bytext simply takes this to it’s logical conclusion. It simplifies many text processes, especially for syllable oriented scripts like Devanagari. It may seem to involve too many characters, but it is a finite number and thus considerably less than the infinite number of abstract characters in Unicode. Also, there is a logic to the way the characters are formed with bytes that makes it easy to process algorithmically, it’s not just a huge list of characters.

 

TOP

 

 

11. Could you give an example of how a word would be encoded in Bytext?

 

Sure, take the word “Gehört”. The character code values for this simple example are listed in the “Ordered lists” section of the documentation. All bytes are represented by 3 digit decimal sequences separated from other bytes in a character by hyphens. The starting byte of each character is preceded with a “B”. This notation is called Bytext decimal notation, characters will be separated by a space for readability: 

 

B016-129 B014 B017 B024-131 B027 B029

 

TOP

 

 

12. Is Bytext bidirectionality compatible with Unicode bidirectioniality?

 

Unicode text transcoded into Bytext will have the same bidirectional display in a Bytext application using the Bytext bidirectionality algorithm. Likewise, text created in Bytext using the Bytext formatting characters and bidirectionality algorithm, then transcoded into Unicode will have the same bidirectional display in a Unicode application. Bidirectional Unicode text will be exactly preserved during a round trip conversion from Unicode to Bytext and back to Unicode; and likewise, bidirectional Bytext text will be exactly preserved during a round trip conversion from Bytext to Unicode and back to Unicode. Further, an input method in a Bytext application can optionally mimic the effect of using Unicode bidirectional formatting codes while actually inserting only Bytext bidirectional formatting codes. 

 

For applications that do not implement their respective bidirectional algorithms, the display is different. The Bytext bidirectional formatting characters, and ordinary right to left characters like Arabic and Hebrew characters have properties that default to being displayed visibly as left to right characters by Bytext applications that do not implement the Bytext bidirectionality algorithm. This way, bidirectional text in a Bytext plain text editor that does not recognize stateful codes can still be read and edited unambiguously. Instead of right to left text appearing invisible and thus subject to being mangled by simple editing, a user is unambiguously visually aware of how to properly edit the text (even if the right to left text cannot be easily deciphered). Bytext is thus more of a robust solution for disparate text processing systems. 

 

 

13. How is using bidirectionality in Bytext easier than in Unicode?

 

Only 2 codes are required to format bidirectional text in Bytext, compared with 7 codes using Unicode. This makes it easier to learn and it frees up keyboard real estate (or it’s analogy in whatever input method is used).

 

To use ANY right-to-left character with confidence in Unicode requires a thorough understanding of the Unicode bidirectional algorithm. Also, in order for an application to avoid using the Unicode bidirectional algorithm, it would need to avoid displaying ANY right-to-left character. It's not enough to avoid using the Unicode bidirectional formatting characters, the application would need to avoid displaying most Arabic and Hebrew characters altogether. 

 

The Unicode bidirectional algorithm allows multiple ways to encode text that results in the same embedding levels (and thus appears the same in terms of directionality). There are 4 different stateful directionality types in the Unicode algorithm. There are also 2 more characters that do not rely on state but can affect directionality based on the character properties of surrounding characters --something you must be thoroughly aware of when using these characters. 

 

An important Unicode bidirectionality formatting character, PDF”, is overloaded in Unicode such that without context analysis, Unicode bidirectionality codes cannot reliably be replaced by markup codes that implement only a subset of the Unicode bidirectional functionality. Special characters are available in Bytext that can be reliably used concurrently with Unicode controls (which are preserved); and as stand-in's for simplified markup codes. This means the Bytext codes can be replaced by markup codes without context analysis. 

 

In the Unicode bidi algorithm, an “embedding layer” is a complex beast that will change directionality based on character properties and language-specific conventions (which violates the principle that Unicode encodes characters not languages). In UAX-9, it even says “A list of numbers separated by neutrals and embedded in a directional run will come out in the run's order.” Is it a mistake that the author uses the term “number” despite the fact that what constitutes a number in a sequence depends on language? No. The algorithm uses language specific conventions for determining how a number vs a list of numbers should be formatted. UAX-9 gives an example of this feature at work, letters with right to left directionality are capitalized: 

 

storage: he said "<RLE>THE VALUES ARE 123, 456, 789, OK<PDF>".

display: he said "KO ,789, 456, 123 ERA SEULAV EHT".

 

The above is an example of how the algorithm tries to be slick and read your mind as to the embedding level you want despite not explicitly asking for it. It is the kind of thing you might expect from an automatic spell checker but not from plain text formatting. When typing a comma after a decimal number in such an embedding layer, it appears to the right of the number; if you press space next, then both the comma and the space jump to the left of the number. Intuitive? Consider the following examples of where someone not thoroughly familiar with the algorithm might run into serious problems:

 

Problems when using alternative conventions for spelling a list of numbers:

 

storage: he said "<RLE>THE HEX VALUES ARE A23, B56, C89, OK<PDF>".

display: he said "KO ,A23, B56, C89 ERA SEULAV XEH EHT".

 

storage: he said "<RLE>THE COMMA SEPARATED VALUES ARE 123,456,789, OK<PDF>".

display: he said "KO ,123,456,789 ERA SEULAV DETARAPES AMMOC EHT".

 

storage: he said "<RLE>THE COMMA SEPARATED VALUES ARE 123,456, 789, OK<PDF>".

display: he said "KO ,789, 123,456 ERA SEULAV DETARAPES AMMOC EHT".

 

Bytext allows language specific conventions for formatting bidirectional text to be implemented by an input method. This is a more flexible, long term solution than the Unicode approach of “hard coding” spelling conventions (such as how to format a list of numbers) into the display method. These spelling conventions also violate the principle that Unicode encodes characters, not languages.

 

The following example shows how Unicode mirroring will change the display and interpretation of text even though no formatting codes are used. Mirroring is positively counterintuitive to the way most people would expect a keyboard or typewriter to work. If no mirroring is desired, users have no choice but to remember which of 66 characters are subject to mirroring and which are not, then specially encode them. Some character glyphs are mirrored in two ways: vertically and horizontally. Which ones? You'll just have to know, because 32 characters subject to mirroring “have no appropriate mirroring character”, meaning they have no authoritative sample glyph. Because of mirroring, there is in fact no “left parenthesis” character in Unicode, despite there being a character with that name in both Unicode and ASCII, because Unicode has redefined the classic ASCII character to mean “opening parenthesis”. Messing with the meaning of ASCII characters is extremely problematic, especially when the name is not changed: 

 

storage: THIS IS A LEFT PARENTHESIS: "("

display

")" :SISEHTNERAP TFEL A SI SIHT

 

Mirroring in the Unicode bidi algorithm is a misguided attempt to equate logical order with logical meaning, as if a script transformation can accomplish a language transformation. The following example demonstrates how this assumption is fundamentally flawed. In this example, 3 characters |/¯ are used to represent the SQUARE ROOT character (U+221A); and 3 characters ¯\| are used to represent the mirrored form of SQUARE ROOT:

 

storage: <RLE>THIS IS AN IMPORTANT NUMBER 4|/¯(5)<PDF>

display

(5)¯\|4 REBMUN TNATROPMI NA SI SIHT

 

storage: <RLE>THIS IS AN IMPORTANT NUMBER 4|/¯5<PDF>

display

4¯\|5 REBMUN TNATROPMI NA SI SIHT

 

When spelling embedded quotations in Unicode, the proper mirroring of each quotation mark is vulnerable to being mangled by the text preceding or following the quotation:

 

storage: HE SAID -in english- <LRE>hello, the weather is nice! 15 hours of sun<PDF>.

display

.”in english- “hello, the weather is nice! 15 hours of sun- DIAS EH

 

The Bytext way of encoding bidirectionality does not have these problems. It is much simpler yet it can encode the same embedding levels and thus has the same display capabilities. It effectively eliminates multiple encodings that achieve the same embedding levels, so like everything else in Bytext it is more regular expression friendly. 

 

TOP

 

 

 


 

Copyright © 2002 Bernard Rafael Miller. All rights reserved.

info@bytext.org