如何生成包含增补字符的随机Unicode字符串?(How to generate a random Unicode string including supplementary characters?

编程入门 行业动态 更新时间:2024-10-18 08:32:31
如何生成包含增补字符的随机Unicode字符串?(How to generate a random Unicode string including supplementary characters?)

我正在研究一些用于生成随机字符串的代码。 结果字符串似乎包含无效的char组合。 具体来说,我发现高代理人没有低代理人。

任何人都可以解释为什么会这样吗? 我是否必须明确生成随机低代理以遵循高代理人? 我假设这不是必需的,因为我正在使用Character类的int变体。

这是测试代码,在最近的运行中产生了以下错误配对:

Bad pairing: d928 - d863 Bad pairing: da02 - 7bb6 Bad pairing: dbbc - d85c Bad pairing: dbc6 - d85c public static void main(String[] args) { Random r = new Random(); StringBuilder builder = new StringBuilder(); int count = 500; while (count > 0) { int codePoint = r.nextInt(Character.MAX_CODE_POINT + 1); if (!Character.isDefined(codePoint) || Character.getType(codePoint) == Character.PRIVATE_USE) { continue; } builder.appendCodePoint(codePoint); count--; } String result = builder.toString(); // Test the result char lastChar = 0; for (int i = 0; i < result.length(); i++) { char c = result.charAt(i); if (Character.isHighSurrogate(lastChar) && !Character.isLowSurrogate(c)) { System.out.println(String.format("Bad pairing: %s - %s", Integer.toHexString(lastChar), Integer.toHexString(c))); } lastChar = c; } }

I'm working on some code for generating random strings. The resulting string appears to contain invalid char combinations. Specifically, I find high surrogates which are not followed by a low surrogate.

Can anyone explain why this is happening? Do I have to explicitly generate a random low surrogate to follow a high surrogate? I had assumed this wasn't needed, as I was using the int variants of the Character class.

Here's the test code, which on a recent run produced the following bad pairings:

Bad pairing: d928 - d863 Bad pairing: da02 - 7bb6 Bad pairing: dbbc - d85c Bad pairing: dbc6 - d85c public static void main(String[] args) { Random r = new Random(); StringBuilder builder = new StringBuilder(); int count = 500; while (count > 0) { int codePoint = r.nextInt(Character.MAX_CODE_POINT + 1); if (!Character.isDefined(codePoint) || Character.getType(codePoint) == Character.PRIVATE_USE) { continue; } builder.appendCodePoint(codePoint); count--; } String result = builder.toString(); // Test the result char lastChar = 0; for (int i = 0; i < result.length(); i++) { char c = result.charAt(i); if (Character.isHighSurrogate(lastChar) && !Character.isLowSurrogate(c)) { System.out.println(String.format("Bad pairing: %s - %s", Integer.toHexString(lastChar), Integer.toHexString(c))); } lastChar = c; } }

最满意答案

可以随机生成高或低代理。 如果这导致低代理,或高代理没有低代理,则结果字符串无效。 解决方案是简单地排除所有代理人:

if (!Character.isDefined(codePoint) || Character.isSurrogate(codePoint) || Character.getType(codePoint) == Character.PRIVATE_USE) { continue; }

(从技术上讲,你也可以允许随机生成的高代理并添加另一个随机的低代理,但这只会创建其他随机代码点> = 0x10000,而这可能是未定义的或供私人使用。)

It's possible to randomly generate high or low surrogates. If this results in a low surrogate, or a high surrogate not followed by a low surrogate, the resulting string is invalid. The solution is to simply exclude all surrogates:

if (!Character.isDefined(codePoint) || Character.isSurrogate(codePoint) || Character.getType(codePoint) == Character.PRIVATE_USE) { continue; }

(Technically, you could also allow randomly generated high surrogates and add another random low surrogate, but this would only create other random code points >= 0x10000 which might in turn be undefined or for private use.)

更多推荐

本文发布于:2023-07-23 02:32:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1226486.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符串   字符   generate   random   Unicode

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!