Discussion:
Incorrect decoding of encoded HTTP headers
Jean Pierre Urkens
2018-10-03 09:22:02 UTC
Permalink
Hi everybody,



I am having an issue where Unicode characters (e.g. Ž and & #105;) are
passed by the Apache Webserver 2.4 to Tomcat as UTF-8 encoded bytes while
Tomcat seems to evaluate them as ISO-8859-15 encoded.



Having taken a network trace with TCPDUMP I see the following bytes for my
header field (truncated the output after byte ‘72’):

0200 0a 48 54 54 50 5f 56 6f 6f 72 6e 61 61 6d 3a 20 .HTTP_Voornaam:

0210 4d 61 c5 82 67 6f 72
MaÅ.gor



Here the bytes C582 is the UTF-8 encoded value for the Unicode character
Ž



Now when inspecting the header value in Tomcat using:

String headerValue = request.getHeader("HTTP_Voornaam");



I’m getting the value ‘MaÅ.gor’ which seems to be using the ISO-8859-15
repesentation for the bytes C582. The byte string from the TCPDUMP seems to
match the result of headerValue.getBytes(Charset.forName("ISO-8859-15"))
and not the result of headerValue.getBytes(Charset.forName("UTF-8")).



The FAQ (https://wiki.apache.org/tomcat/FAQ/CharacterEncoding) indicates
that ‘headers are always in US-ASCII encoding. Anything outside of that
needs to be encoded’, in this case it seems to be UTF-8 encoded.

The headers are evaluated by a servlet 2.5 web application which has defined
a ‘CharacterEncodingFilter’ as first filter performing the following
actions:

request.setCharacterEncoding("UTF-8");

response.setContentType("text/html; charset=UTF-8");

response.setCharacterEncoding("UTF-8");

filterChain.doFilter(request, response);



Is there a way to tell Tomcat to decode the headers as being UTF-8 encoded
bytes?



I am using Tomcat-version 8.5.32.



Thanks for your support,



J.P.
Mark Thomas
2018-10-03 09:47:00 UTC
Permalink
Post by Jean Pierre Urkens
Hi everybody,
I am having an issue where Unicode characters (e.g. Ž and & #105;) are
passed by the Apache Webserver 2.4 to Tomcat as UTF-8 encoded bytes while
Tomcat seems to evaluate them as ISO-8859-15 encoded.
Having taken a network trace with TCPDUMP I see the following bytes for my
0210 4d 61 c5 82 67 6f 72
MaÅ.gor
Here the bytes C582 is the UTF-8 encoded value for the Unicode character
Ž
String headerValue = request.getHeader("HTTP_Voornaam");
I’m getting the value ‘MaÅ.gor’ which seems to be using the ISO-8859-15
repesentation for the bytes C582. The byte string from the TCPDUMP seems to
match the result of headerValue.getBytes(Charset.forName("ISO-8859-15"))
and not the result of headerValue.getBytes(Charset.forName("UTF-8")).
The FAQ (https://wiki.apache.org/tomcat/FAQ/CharacterEncoding) indicates
that ‘headers are always in US-ASCII encoding. Anything outside of that
needs to be encoded’, in this case it seems to be UTF-8 encoded.
From the HTTP spec:

<quote>
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
</quote>

Sending raw UTF-8 bytes and having them decoded as such has newer been
part of the Servlet spec (and is discouraged by the HTTP spec).

Tomcat has never supported the use of RFC2047 encoding. It has been
considered in the past but I'm not aware of any mainstream client that
supports it.

Tomcat does allow raw UTF-8 in the cookie header (although neither the
Cookie nor the HTTP spec allows this) because most (all major?) browsers
sent raw UTF-8 in the cookie header.

If you know that the data is always going to be UTF-8 then you can do
the (fairly ugly):

String utf8Value = new String(
headerValue.getBytes(StandardCharsets.ISO_8859_1),
StandardCharsets.UTF_8);

The servlet spec should probably provide a mechanism to obtain the
header data as bytes and/or decode them using a given encoding.
Post by Jean Pierre Urkens
The headers are evaluated by a servlet 2.5 web application which has defined
a ‘CharacterEncodingFilter’ as first filter performing the following
request.setCharacterEncoding("UTF-8");
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
filterChain.doFilter(request, response);
None of those apply to HTTP headers.
Post by Jean Pierre Urkens
Is there a way to tell Tomcat to decode the headers as being UTF-8 encoded
bytes?
No.
Post by Jean Pierre Urkens
I am using Tomcat-version 8.5.32.
Thanks for providing that information. A lot of people forget.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-***@tomcat.apache.org
For additional commands, e-mail: users-***@tomcat.apache.org
Michael Osipov
2018-10-03 10:11:05 UTC
Permalink
Post by Jean Pierre Urkens
Hi everybody,
I am having an issue where Unicode characters (e.g. &#142; and & #105;) are
passed by the Apache Webserver 2.4 to Tomcat as UTF-8 encoded bytes while
Tomcat seems to evaluate them as ISO-8859-15 encoded.
Having taken a network trace with TCPDUMP I see the following bytes for my
0210 4d 61 c5 82 67 6f 72
MaÅ.gor
Here the bytes C582 is the UTF-8 encoded value for the Unicode character
&#142;
String headerValue = request.getHeader("HTTP_Voornaam");
I’m getting the value ‘MaÅ.gor’ which seems to be using the ISO-8859-15
repesentation for the bytes C582. The byte string from the TCPDUMP seems to
match the result of headerValue.getBytes(Charset.forName("ISO-8859-15"))
and not the result of headerValue.getBytes(Charset.forName("UTF-8")).
The FAQ (https://wiki.apache.org/tomcat/FAQ/CharacterEncoding) indicates
that ‘headers are always in US-ASCII encoding. Anything outside of that
needs to be encoded’, in this case it seems to be UTF-8 encoded.
The headers are evaluated by a servlet 2.5 web application which has defined
a ‘CharacterEncodingFilter’ as first filter performing the following
request.setCharacterEncoding("UTF-8");
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
filterChain.doFilter(request, response);
Is there a way to tell Tomcat to decode the headers as being UTF-8 encoded
bytes?
This is not defined and do not expect it to work properly. The best and
morstreliable you can do is to encode your values with
https://tools.ietf.org/html/rfc5987. This is the same approach done for
Content-Disposition filename qualifier. You may want to evaluate mod_lua
for that.

Everything else will make you suffer as you have seen.

Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: users-***@tomcat.apache.org
For additional commands, e-mail: users-***@tomcat.apache.org

Loading...