Java char数组是否始终是有效的UTF

编程入门行业动态更新时间:2024-10-28 10:23:52

Java char数组是否始终是有效的UTF-16（Big Endian）编码？(Is a Java char array always a valid UTF-16 (Big Endian) encoding?)

假设我将Java字符数组（ char[] ）实例编码为字节：

每个字符使用两个字节使用大端编码（在最左边的最左边存储最高8位，在最右边的字节存储最不重要的8位）

这总是会创建一个有效的UTF-16BE编码吗？如果不是，哪些代码点会导致无效的编码？

这个问题与这个关于Java字符类型和关于Java字符串内部表示的问题有很大关系。

Say that I would encode a Java character array (char[]) instance as bytes:

using two bytes for each character using big endian encoding (storing the most significant 8 bits in the leftmost and the least significant 8 bits in the rightmost byte)

Would this always create a valid UTF-16BE encoding? If not, which code points will result in an invalid encoding?

This question is very much related to this question about the Java char type and this question about the internal representation of Java strings.

最满意答案

不可以。您可以创建包含您需要的任何16位值的char实例 - 没有任何内容将它们限制为有效的UTF-16代码单元，也不会将它们的数组限制为有效的UTF-16序列。即使String不要求其数据是有效的UTF-16：

char data[] = {'\uD800', 'b', 'c'}; // Unpaired lead surrogate String str = new String(data);

有效的UTF-16数据要求在Unicode标准的第3章中列出（基本上，所有内容必须是Unicode标量值，所有代理必须正确配对）。您可以测试char数组是否是有效的UTF-16序列，并使用CharsetEncoder将其转换为UTF-16BE（或LE）字节CharsetEncoder ：

CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder(); ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException

（如果你有字节，也可以使用CharsetDecoder 。）

No. You can create char instances that contain any 16-bit value you desire---there is nothing that constrains them to be valid UTF-16 code units, nor constrains an array of them to be a valid UTF-16 sequence. Even String does not require that its data be valid UTF-16:

char data[] = {'\uD800', 'b', 'c'}; // Unpaired lead surrogate String str = new String(data);

The requirements for valid UTF-16 data are set out in Chapter 3 of the Unicode Standard (basically, everything must be a Unicode scalar value, and all surrogates must be correctly paired). You can test if a char array is a valid UTF-16 sequence, and turn it into a sequence of UTF-16BE (or LE) bytes, by using a CharsetEncoder:

CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder(); ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException

(And similarly using a CharsetDecoder if you have bytes.)

更多推荐

本文发布于:2023-04-28 08:39:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1331389.html