CESU-8
From Wikipedia, the free encyclopedia
Unicode |
---|
Encodings |
UCS |
Mapping |
Bi-directional text |
BOM |
Han unification |
Unicode and HTML |
Unicode and e-mail |
Unicode typefaces |
CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. The code point is first represented with UTF-16, and then that result is re-encoded in UTF-8. It is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000). Like Modified UTF-8, it can be decoded into one UTF-16 word at a time. Because it doesn't have special treatment of NUL, the resulting string will not be safe for NUL-terminated string handling if the original string contained NUL characters.
In practice, CESU-8 is often used to communicate with the Oracle database software, which in modern configurations apparently uses UTF-16 as an internal character representation. Oracle's "UTF-8" (actually CESU-8) codec rejects proper UTF-8 sequences for characters from outside the Basic Multilingual Plane, but happily accepts and generates technically invalid UTF-8 sequences for codepoints in the surrogate range (U+D800 .. U+DFFF), as specified in CESU-8.