我即将实现Boyer-Moore模式匹配算法(具体来说是周日算法)的变体,我在问自己:我的字母大小是多少?
I'm about to implement a variation of the Boyer-Moore pattern matching algorithm (the Sunday algorithm to be specific) and I was asking myself: What is my alphabet size?
这是否取决于编码(=可能的字符数),或者我是否可以仅假设我的字母由256个符号组成(=可以由一个字节表示的符号数)?
Does it depend on the encoding (= number of possible characters) or can I just assume my alphabet consists of 256 symbols (= number of symbols which can be represented by a byte)?
在许多其他情况下,将字符视为字节是一个问题,因为取决于编码,一个字符可以包含多个字节,但是如果在我的情况下,两个字符串都具有相同的编码,则相等的字符由相等的字节序列表示,因此我认为这没关系.
In many other situations treating characters as bytes would be a problem because depending on the encoding a character can consist of multiple bytes, but if in my case both strings have the same encoding then equal characters are represented by equal byte sequences, so I would assume it doesn't matter.
那么:我是否必须考虑编码并假定一个由实际字符组成的字母(对于Unicode,> 90000),还是可以将文本和模式作为字节流来处理?
So: Do I have to take the encoding into account and assume an alphabet consisting of the actual characters (> 90000 for Unicode) or can I just handle the text and the pattern as a stream of bytes?
推荐答案多字节编码可与面向字节的搜索例程 IF 一起使用,它是自我同步.
A multi-byte encoding can be used with a byte-oriented search routine IF it is self-synchronizing.
因此,您可以安全地将Boyer-Moore用于:
So, you can safely use Boyer-Moore with:
- CESU-8
- UTF-8
- UTF-EBCDIC
但可以不能与以下对象一起使用:
But can NOT use it with:
- Big5
- EUC-JP
- GBK/GB18030
- ISO 2022
- Johab
- Punycode
- Shift-JIS
- UTF-7
- UTF-16
- UTF-32
更多推荐
执行Boyer
发布评论