Rectification to the clarification : what I say below about UTF-16 being
always 16-bit and limited is also nonsense. UTF-16 is variable-length,
it can cover the entire Unicode character set. It just uses a variable
number of 16-bit words per character, as compared to UTF-8 which uses a
variable number of 8-bit bytes.
I should have checked my sources. Shame on me.
About Java's internal char type being 16-bit wide though, I have heard
that too, and I'm also curious.
André Warnier wrote:
> Caldarale, Charles R wrote:
>>> From: Christopher Schultz [mailto:***@christopherschultz.net]
>>> Subject: Re: Migrating to tomcat 6 gives formatted currency
>>> amounts problem
>>>
>>> (My understanding is that Unicode (16-bit) is actually not
>>> big enough for everything, but hey, they tried).
>>
>> Point of clarification: Unicode is NOT limited to 16 bits (not even in
>> Java, these days). There are defined code points that use 32 bits,
>> and I don't think there's a limit, if you use the defined extension
>> mechanisms. Again, browsing the Unicode web site is extremely
>> enlightening.
>>
> Further clarification :
> Unicode is not limited to anything. Unicode is (aims to be) a list
> which attributes to any distinct character known to man, a number, from
> 0 to infinity. The particular position number given to a particular
> character in this Unicode list is known as its "Unicode codepoint".
> The Unicode group (consortium ?) also tries to do this with some order,
> such as trying to keep together (with consecutive codepoints) various
> groups of characters that are logically related in some way.
> For example (but probably because they had to start somewhere), the
> first 128 codepoints match the original 7-bit US-ASCII alphabet;
> so for instance the "capital letter A", which has code \x41 in US-ASCII,
> happens to have Unicode codepoint \x0041 (both 65 in decimal terms).
> For example also, the same first 128 codepoints, plus the next 128
> codepoints, match the iso-8859-1 alphabet (also known as iso-latin-1);
> thus the character known as "capital letter A with umlaut" (an A with a
> double-dot on top) has the codepoint \x00C4 in Unicode, and the code
> \xC4 in iso-8859-1 (both 196 in decimal).
>
> New Unicode characters (and codepoints) are being added all the time (I
> think there's even Klingon in there), but there are also holes in the
> list (presumably left for whenever some forgotten related character
> shows up).
>
> A quite different issue is encoding.
>
> Because it would be quite impractical to specify a series of characters
> just by writing their codepoints one after the other (using whatever
> number of bits each codepoint needs), a series of clever schemes have
> been devised in order to pass Unicode strings around, while being able
> to separate them into characters, and keep each one with its proper
> codepoint.
> Such schemes are known as "Unicode encodings" with names such as UTF-2,
> UTF-7, UTF-8, UTF-16, UTF-32, etc..
> Each one of them specifies an algorithm whereby one can take any Unicode
> character (or rather, its codepoint), and "encode" it into a series of
> bits, in such a way that at the receiving end, an opposite algorithm can
> be used to "decode" that series of bits and retrieve once again the same
> series of Unicode codepoints (or characters).
>
> UTF-16, for example, is an encoding of Unicode which uses always 16 bits
> for each Unicode codepoint; but it is to my knowledge incomplete,
> because since it uses a fixed number of 16 bit per character, it can
> thus only ever represent no more than the first 65,532 Unicode
> characters. (But we're not there yet, and there is still some leeway).
>
> UTF-8 on the other hand is a variable-length scheme, using 1, 2, 3, or
> more 8-bit groups to represent each Unicode codepoint. And it is in
> principle not limited, as there are extension mechanisms foreseen for
> whenever the need arises (imagine that some aliens suddenly show up, and
> that they happen to write in 167 different languages and alphabets).
>
> One frequent misconception is that in UTF-8, the first 256 "character
> encoding bit sequences" match the iso-8859-1 codepoints.
> Only the first 128 characters of iso-8859-1 (which happen to match the
> 128 characters of US-ASCII and the first 128 Unicode codepoints), have a
> single-byte representation in UTF-8 which happens to match their Unicode
> codepoint. The next 128 iso-8859-1 characters (which contain the
> capital A with umlaut) require 2 bytes each in the UTF-8 encoding.
> Thus for instance, the "capital letter A with umlaut" has the Unicode
> codepoint \x00C4 (196 decimal), because is is the 197th character in the
> Unicode list (and the first one is \x0000). It also happens to have the
> code \xC4 (196 decimal) in the iso-8859-1 table.
> But in UTF-8, it is encoded as the two bytes \xC3\x84, which is not the
> decimal number 196 in any way.
>
>
> All of that to say that when some people on this list say things like
> "you should always decode your URLs as if they were Unicode (or UTF-8),
> because it is the same as ASCII or iso-latin-1 anyway", they are talking
> nonsense. The only time you can do that is when the server and all the
> clients have agreed in advance that this is how they were going to
> encode and decode URLs.
> (That we developers wish it were so, and that ultimately we may get
> there, is another matter.)
>
> It is also talking nonsense to say that you should by default consider
> html pages as UTF-8 encoded. The default character set (and encoding,
> because in that case both are the same) for html is iso-8859-1, and
> anything else (including UTF-8 or UTF-16) is non-default.
> (see http://www.ietf.org/rfc/rfc2854.txt, section 6).
> (So if you do output something else, you *must* say so).
> (And hope that IE doesn't second-guess you).
>
> We probably owe that to Tim Berners-Lee, and with tons of respect and
> admiration for the guy notwithstanding, it may be an unfortunate
> historical accident that he was born in England and worked in
> Switzerland (both countries quite happy with iso-8859-1), rather than
> being a Chinese national working in Greece e.g., who might have
> preferred Unicode and UTF-8. But hey, he invented it, so he got to choose.
>
> Anyway for the time being we all have to live with it.
> Even the Tomcat guys.
>
>
> ---------------------------------------------------------------------
> To start a new topic, e-mail: ***@tomcat.apache.org
> To unsubscribe, e-mail: users-***@tomcat.apache.org
> For additional commands, e-mail: users-***@tomcat.apache.org
>
---------------------------------------------------------------------
To start a new topic, e-mail: ***@tomcat.apache.org
To unsubscribe, e-mail: users-***@tomcat.apache.org
For additional commands, e-mail: users-***@tomcat.apache.org