使用缓冲区将字符串转换为UTF

使用缓冲区将字符串转换为UTF-8(Converting string to UTF-8 using buffer)

我需要将（可能很大的）字符串转换为UTF-8，但我不想创建包含完整编码的字节数组。我的想法是为此使用CharsetEncoder ，但CharsetEncoder仅作用于CharBuffer ，这意味着应该考虑补充字符（在Unicode范围0x0000到0xFFFF ）。

现在我使用的方法是CharBuffer.wrap(String.substring(start, start + BLOCK_SIZE)) ，我的ByteBuffer是使用ByteBuffer.allocate((int) Math.ceil(encoder.maxBytesPerChar() * BLOCK_SIZE)) 。但是， CharBuffer现在将包含BLOCK_SIZE代码点，而不是代码单元（字符）; 我认为实际的字符数量将是BLOCK_SIZE 。这意味着我的ByteBuffer也是两倍太小了。

如何计算ByteBuffer的正确字节数？我可以简单地加倍它，以防每个角色都是一个补充角色，但这似乎有点多。但唯一合理的选择似乎是迭代所有代码单元（字符）或代码点，这至少看起来不是最理想的。

什么是最有效的编码Strings零碎方法的提示？我应该使用缓冲区，使用String.codePointAt(location)进行迭代，还是有一个直接处理代码点的编码例程？

附加要求：无效的字符编码应导致异常，不能允许默认替换或跳过无效字符。

I need to convert a (possibly large) string to UTF-8, but I don't want to create a byte array containing the full encoding. My idea was to use a CharsetEncoder for this, but CharsetEncoder only acts on CharBuffer, which means that supplemental characters (outside the Unicode range 0x0000 to 0xFFFF) should be considered.

Now the method I was using was CharBuffer.wrap(String.substring(start, start + BLOCK_SIZE)), and my ByteBuffer is created using ByteBuffer.allocate((int) Math.ceil(encoder.maxBytesPerChar() * BLOCK_SIZE)). However, CharBuffer will now contain BLOCK_SIZE code points, not code units (characters); I think the actual amount of characters will be two times BLOCK_SIZE as maximum. This means that my ByteBuffer is two times too small as well.

How can I calculate the correct amount of bytes for my ByteBuffer? I could simply double it in case each and every character is a supplemental character, but that seems a bit much. But the only other reasonable option seems to iterate over all code units (characters) or code points, which at least looks suboptimal.

Any hints on what's the most efficient approach to encode Strings piecemeal? Should I use the buffer, the iteration with String.codePointAt(location), or is there an encoding routine that directly handles code points?

Additional requirement: invalid character encodings should result in an exception, default substitution or skipping of invalid characters cannot be allowed.

最满意答案

似乎简单地包装整个字符串更容易，然后盲目地读取字符，直到没有剩余。无需在部分中剪切字符串，编码器将只读取字节，直到输出缓冲区被填满：

final CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder(); final CharBuffer buffer = CharBuffer.wrap(input); final ByteBuffer encodedBuffer = ByteBuffer.allocate(BUFFER_SIZE); CoderResult coderResult; while (buffer.hasRemaining()) { coderResult = encoder.encode(buffer, encodedBuffer, false); if (coderResult.isError()) { throw new IllegalArgumentException( "Invalid code point in input string"); } encodedBuffer.flip(); // do stuff with encodedBuffer encodedBuffer.clear(); } // required by encoder: call encode with true to indicate end coderResult = encoder.encode(buffer, encodedBuffer, true); if (coderResult.isError()) { throw new IllegalArgumentException( "Invalid code point in input string"); } encodedBuffer.flip(); // do stuff with encodedBuffer encodedBuffer.clear(); // if still required

It seems it is easier to simply wrap the whole string, and then blindly read characters until none are remaining. No need to cut the string in parts, the encoder will just read bytes until the output buffer is filled up:

final CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder(); final CharBuffer buffer = CharBuffer.wrap(input); final ByteBuffer encodedBuffer = ByteBuffer.allocate(BUFFER_SIZE); CoderResult coderResult; while (buffer.hasRemaining()) { coderResult = encoder.encode(buffer, encodedBuffer, false); if (coderResult.isError()) { throw new IllegalArgumentException( "Invalid code point in input string"); } encodedBuffer.flip(); // do stuff with encodedBuffer encodedBuffer.clear(); } // required by encoder: call encode with true to indicate end coderResult = encoder.encode(buffer, encodedBuffer, true); if (coderResult.isError()) { throw new IllegalArgumentException( "Invalid code point in input string"); } encodedBuffer.flip(); // do stuff with encodedBuffer encodedBuffer.clear(); // if still required

更多推荐

使用缓冲区将字符串转换为UTF

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表