pcre2 UTF32用法(pcre2 UTF32 usage)

我花了一些时间搞清楚pcre2接口，并认为我已经得到了它的大部分内容。我想支持UTF32，pcre2已经内置支持，代码点宽度已设置为32。

以下代码是我将代码点宽度设置为8所得到的。如何将其更改为与UTF32一起使用？

#include "gtest/gtest.h" #include <pcre2.h> TEST(PCRE2, example) { //iterate over all matches in a string PCRE2_SPTR subject = (PCRE2_SPTR) string("this is it").c_str(); PCRE2_SPTR pattern = (PCRE2_SPTR) string("([a-z]+)|\\s").c_str(); int errorcode; PCRE2_SIZE erroroffset; pcre2_code *re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, PCRE2_ANCHORED | PCRE2_UTF, &errorcode, &erroroffset, NULL); if (re) { uint32_t groupcount = 0; pcre2_pattern_info(re, PCRE2_INFO_BACKREFMAX, &groupcount); pcre2_match_data *match_data = pcre2_match_data_create_from_pattern(re, NULL); uint32_t options_exec = PCRE2_NOTEMPTY; PCRE2_SIZE subjectlen = strlen((const char *) subject); errorcode = pcre2_match(re, subject, subjectlen, 0, options_exec, match_data, NULL); while (errorcode >= 0) { PCRE2_UCHAR *result; PCRE2_SIZE resultlen; for (int i = 0; i <= groupcount; i++) { pcre2_substring_get_bynumber(match_data, i, &result, &resultlen); printf("Matched:%.*s\n", (int) resultlen, (const char *) result); pcre2_substring_free(result); } // Advance through subject PCRE2_SIZE *ovector = pcre2_get_ovector_pointer(match_data); errorcode = pcre2_match(re, subject, subjectlen, ovector[1], options_exec, match_data, NULL); } pcre2_match_data_free(match_data); pcre2_code_free(re); } else { // Syntax error in the regular expression at erroroffset PCRE2_UCHAR error[256]; pcre2_get_error_message(errorcode, error, sizeof(error)); printf("PCRE2 compilation failed at offset %d: %s\n", (int) erroroffset, (char *) error); }

假设subject和pattern需要以某种方式转换， result将是相同的类型？我在pcre2标题中找不到任何内容来表示支持。我想subjectlen不再仅仅是strlen 。

最后，我通过一些文档和标题将这个例子放在一起，还有什么我应该做的/值得知道。

I've just spent some time figuring out the pcre2 interface and think I've got it for the most part. I want to support UTF32, pcre2 is already built with support and code point width has been set to 32.

The code below is what I've got for working with code point width set to 8. How do I change this to work with UTF32?

#include "gtest/gtest.h" #include <pcre2.h> TEST(PCRE2, example) { //iterate over all matches in a string PCRE2_SPTR subject = (PCRE2_SPTR) string("this is it").c_str(); PCRE2_SPTR pattern = (PCRE2_SPTR) string("([a-z]+)|\\s").c_str(); int errorcode; PCRE2_SIZE erroroffset; pcre2_code *re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, PCRE2_ANCHORED | PCRE2_UTF, &errorcode, &erroroffset, NULL); if (re) { uint32_t groupcount = 0; pcre2_pattern_info(re, PCRE2_INFO_BACKREFMAX, &groupcount); pcre2_match_data *match_data = pcre2_match_data_create_from_pattern(re, NULL); uint32_t options_exec = PCRE2_NOTEMPTY; PCRE2_SIZE subjectlen = strlen((const char *) subject); errorcode = pcre2_match(re, subject, subjectlen, 0, options_exec, match_data, NULL); while (errorcode >= 0) { PCRE2_UCHAR *result; PCRE2_SIZE resultlen; for (int i = 0; i <= groupcount; i++) { pcre2_substring_get_bynumber(match_data, i, &result, &resultlen); printf("Matched:%.*s\n", (int) resultlen, (const char *) result); pcre2_substring_free(result); } // Advance through subject PCRE2_SIZE *ovector = pcre2_get_ovector_pointer(match_data); errorcode = pcre2_match(re, subject, subjectlen, ovector[1], options_exec, match_data, NULL); } pcre2_match_data_free(match_data); pcre2_code_free(re); } else { // Syntax error in the regular expression at erroroffset PCRE2_UCHAR error[256]; pcre2_get_error_message(errorcode, error, sizeof(error)); printf("PCRE2 compilation failed at offset %d: %s\n", (int) erroroffset, (char *) error); }

Presumably subject and pattern needs to be converted somehow and result would be of the same type? I couldn't find anything in pcre2 header to indicate support for that. And I guess subjectlen would no longer be simply strlen.

Finally, I put this example together from having gone through some of the docs and the header, is there anything else I should be doing/worth knowing.

最满意答案

我最后离开了pcre2，在评估了RE2，PCRE2和ICU后，我选择了ICU。它的unicode支持（从我迄今为止看到的）比其他两个更完整。它还提供了一个非常干净的API和许多用于操作的实用程序。重要的是，像PCRE2一样提供了一个perl风格的正则表达式引擎，它开箱即用，非常适合unicode。

I left pcre2 in the end, after evaluating RE2, PCRE2 and ICU, I chose ICU. Its unicode support (from what I've seen so far) much more complete than the other two. It also provides a very clean API and lots of utilities for manipulation. Importantly, like PCRE2 provides a perl style regex engine which, out of the box works great with unicode.

更多推荐

pcre2 UTF32用法(pcre2 UTF32 usage)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表