Jean Pierre Urkens
2018-10-03 09:22:02 UTC
Hi everybody,
I am having an issue where Unicode characters (e.g. Ž and & #105;) are
passed by the Apache Webserver 2.4 to Tomcat as UTF-8 encoded bytes while
Tomcat seems to evaluate them as ISO-8859-15 encoded.
Having taken a network trace with TCPDUMP I see the following bytes for my
header field (truncated the output after byte 72):
0200 0a 48 54 54 50 5f 56 6f 6f 72 6e 61 61 6d 3a 20 .HTTP_Voornaam:
0210 4d 61 c5 82 67 6f 72
MaÅ.gor
Here the bytes C582 is the UTF-8 encoded value for the Unicode character
Ž
Now when inspecting the header value in Tomcat using:
String headerValue = request.getHeader("HTTP_Voornaam");
Im getting the value MaÅ.gor which seems to be using the ISO-8859-15
repesentation for the bytes C582. The byte string from the TCPDUMP seems to
match the result of headerValue.getBytes(Charset.forName("ISO-8859-15"))
and not the result of headerValue.getBytes(Charset.forName("UTF-8")).
The FAQ (https://wiki.apache.org/tomcat/FAQ/CharacterEncoding) indicates
that headers are always in US-ASCII encoding. Anything outside of that
needs to be encoded, in this case it seems to be UTF-8 encoded.
The headers are evaluated by a servlet 2.5 web application which has defined
a CharacterEncodingFilter as first filter performing the following
actions:
request.setCharacterEncoding("UTF-8");
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
filterChain.doFilter(request, response);
Is there a way to tell Tomcat to decode the headers as being UTF-8 encoded
bytes?
I am using Tomcat-version 8.5.32.
Thanks for your support,
J.P.
I am having an issue where Unicode characters (e.g. Ž and & #105;) are
passed by the Apache Webserver 2.4 to Tomcat as UTF-8 encoded bytes while
Tomcat seems to evaluate them as ISO-8859-15 encoded.
Having taken a network trace with TCPDUMP I see the following bytes for my
header field (truncated the output after byte 72):
0200 0a 48 54 54 50 5f 56 6f 6f 72 6e 61 61 6d 3a 20 .HTTP_Voornaam:
0210 4d 61 c5 82 67 6f 72
MaÅ.gor
Here the bytes C582 is the UTF-8 encoded value for the Unicode character
Ž
Now when inspecting the header value in Tomcat using:
String headerValue = request.getHeader("HTTP_Voornaam");
Im getting the value MaÅ.gor which seems to be using the ISO-8859-15
repesentation for the bytes C582. The byte string from the TCPDUMP seems to
match the result of headerValue.getBytes(Charset.forName("ISO-8859-15"))
and not the result of headerValue.getBytes(Charset.forName("UTF-8")).
The FAQ (https://wiki.apache.org/tomcat/FAQ/CharacterEncoding) indicates
that headers are always in US-ASCII encoding. Anything outside of that
needs to be encoded, in this case it seems to be UTF-8 encoded.
The headers are evaluated by a servlet 2.5 web application which has defined
a CharacterEncodingFilter as first filter performing the following
actions:
request.setCharacterEncoding("UTF-8");
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
filterChain.doFilter(request, response);
Is there a way to tell Tomcat to decode the headers as being UTF-8 encoded
bytes?
I am using Tomcat-version 8.5.32.
Thanks for your support,
J.P.