Why can Java display chinese characters although it is using a wrong encoding?

I've a bunch of chinese characters in say DB or XML file. They are stored there using UTF-8 encoding. And now i need to get this information in my Java code. I read the XML using DOM parser and stored the chinese character in a String Object. This was later displayed in the JSP Page and printed in the System out console. It is working fine. I do not know why?

As per my understanding, Java should use the proper encoding (in this case UTF-8) to store the Chinese character. But when I checked the default encoding used by JVM it is not UTF-8 or 16. It is some Cp1522(not sure if this is correct,I cannot recollect the correct value, my apologies).

So it should not be able to print the values right? Could you please help to know why this is working?


ANSWERS:


The "default" you refer to is probably the "platform default", which is used when no other encoding information is available, but only for reading character streams into or out of the JVM. Once inside the JVM, all characters are represented in UTF-16. The encoding you mentioned is probably Cp1252. It would be impossible to represent Chinese characters in this encoding, so that's not what's happening. You'd have to be more specific about what's happening, but the XML parser you're using is probably detecting the correct encoding to use and thus not garbling it.


Assuming everything is working, this is how it'd work:

Your XML parser decodes the XML and converts it to Java's internal representation (effectively UTF-16 -- a Java char is actually a UTF-16 code unit, not a "character").

When you render a JSP it's encoding the page based on your Servlet container configuration. The HTTP headers probably include the encoding being used, so your browser can decode it correctly.

Here's where it becomes unclear whether things really are working. What ends up in System.out depends on how you're writing to it. You say "printed", so I'm guessing you're using the print methods, which means the platform's default character encoding is being used. If this encoding really is CP-1252 (the only one I can think of that sounds like Cp1522) and the result looks "right", then actually something is wrong.

CP-1252 is essentially Latin-1, which is sometimes abused into being treated as "bytes == chars". That would suggest that your multi-byte Chinese characters are actually being converted into multiple Java chars. This would only be correct behavior in the case of non-BMP/plane-0 characters, and in that case these character should become a surrogate pair.

To test what's going on, try putting the two characters 你好 into your XML and testing the length of the parsed String. The length should be 2 (those are both BMP characters). If the length is something bigger (probably 6) then you're decoding incorrectly and things only seem to work because you're re-encoding the same (wrong) way.


I will recommend you check your default IDE workspace encoding setting to "UTF-8". Otherwise it will change the encoding when you modify the xml files.

Anyway you seems to be more interested in how DOMParser works. But DOMParser can decide its encoding. It probably uses its own default encoding. You can debug into it and see what encoding it is using.



 MORE:


 ? URL-encoded form data is not valid
 ? Encoded form data
 ? Google Search Result Encoding in Chinese
 ? Java Encoding: why the output is always the same?
 ? i got different result with same code for converting byte to string function running in different JRE version (jre7 and jre8)
 ? Why does the encoding go wrong?
 ? Does specifying the encoding in javac yield the same results as changing the active code page in Windows CMD and then compiling directly?
 ? Character not displaying in html
 ? Character not displaying in html
 ? Character not displaying in html