二元序列检测器(binary sequence detector)

有没有人知道一个优化的方式来检测一个二进制数据块中的37位序列是最优的。当然我可以使用窗口进行强力比较（只需比较索引0 +下一个36位，增量和循环直到找到它），但有没有更好的方法？也许一些哈希搜索返回序列位于二进制块内的概率？或者我只是把它从我的屁股中抽出来？无论如何，我正在进行蛮力搜索，但我很好奇是否有更优化的东西。顺便说一下，这在C中。

Does anyone know of an optimized way of detecting a 37 bit sequence in a chunk of binary data that is optimal. Sure I can do a brute force compare using windowing (just compare starting with index 0+next 36 bits, increment and loop until i find it) but is there a better way? Maybe some hashing search that returns a probability that the sequence lies within a binary chunk? Or am I just pulling that out of my butt? Anyway, I'm going ahead with the brute force search, but I was curious if there was something more optimal. This is in C by the way.

最满意答案

有趣的问题。我假设你的37位序列可以在任何一个字节开始。假设你的序列是这样表示的：

ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@

如果我们有一个字节对齐的算法，我们可以看到这些32位序列字节：

BCDEFGHIJKLMNOPQRSTUVWXYZ0123456 [call this pattern w_A] CDEFGHIJKLMNOPQRSTUVWXYZ01234567 [w_B, etc.] DEFGHIJKLMNOPQRSTUVWXYZ012345678 EFGHIJKLMNOPQRSTUVWXYZ0123456789 FGHIJKLMNOPQRSTUVWXYZ0123456789@ GHIJKLMNOPQRSTUVWXYZ0123456789@x HIJKLMNOPQRSTUVWXYZ0123456789@xx IJKLMNOPQRSTUVWXYZ0123456789@xxx

只有这些字节值 - 不能是其他字节值 - 才能形成包含37位感兴趣的字节序列的第二个第三和第四个字节。

这导致了一个相当明显的实现：

unsigned char *p = ...; // input data size_t n = ...; // bytes available size_t bitpos; --n; p++; bitpos = 0; while (n--) { uint32_t word = *(uint32_t*)p; // nonportable, sorry. bitpos += 8; // compiler should be able to optimise this variable out completely if (word == w_A) { if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) { // we found the data starting at the 8th bit of p-1 found_at(bitpos-1); } } else if (word == w_B) { if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) { // we found the data starting at the 7th bit of p-1 found_at(bitpos-2); } } else if (word == w_C} { ... } ... }

显然这个策略存在问题。首先，它可能需要在循环的第一次评估p [-1]，但这很容易解决。其次，它从奇数地址获取一个单词; 不适用于某些CPU - 例如SPARC和68k。但是这样做是将4个比较合并为一个的简单方法。

kek444的建议将允许您使用像KMP这样的算法在数据流中跳过前进。然而，跳过的最大尺寸并不是很大，所以尽管Turbo Boyer-Moore算法可能会将字节比较的数量减少4个左右，但如果字节比较的成本类似于单词比较的成本。

Interesting question. I assume your 37-bit sequence can begin at any point in a byte. Let's say your sequence is represented by this:

ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@

If we have a byte aligned algorithm, we could see these 32-bit sequences bytes:

BCDEFGHIJKLMNOPQRSTUVWXYZ0123456 [call this pattern w_A] CDEFGHIJKLMNOPQRSTUVWXYZ01234567 [w_B, etc.] DEFGHIJKLMNOPQRSTUVWXYZ012345678 EFGHIJKLMNOPQRSTUVWXYZ0123456789 FGHIJKLMNOPQRSTUVWXYZ0123456789@ GHIJKLMNOPQRSTUVWXYZ0123456789@x HIJKLMNOPQRSTUVWXYZ0123456789@xx IJKLMNOPQRSTUVWXYZ0123456789@xxx

Only these byte values - no others -could form the second third and fourth byte of a byte sequence containing the 37 bits of interest.

This leads to a reasonably obvious implementation:

unsigned char *p = ...; // input data size_t n = ...; // bytes available size_t bitpos; --n; p++; bitpos = 0; while (n--) { uint32_t word = *(uint32_t*)p; // nonportable, sorry. bitpos += 8; // compiler should be able to optimise this variable out completely if (word == w_A) { if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) { // we found the data starting at the 8th bit of p-1 found_at(bitpos-1); } } else if (word == w_B) { if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) { // we found the data starting at the 7th bit of p-1 found_at(bitpos-2); } } else if (word == w_C} { ... } ... }

Obviously there are problems with this strategy. First, it might want to evaluate p[-1] first time around the loop, but that's easy to fix. Second, it fetches a word from odd addresses; that wont work on some CPUs - SPARC and 68k for example. But doing so is an easy way to roll 4 comparisons into one.

kek444's suggestion would allow you to use a algorithm like KMP to skip forward in the data stream. However, the maximum size of the skip is not huge, so while the Turbo Boyer-Moore algorithm may reduce the number of byte comparisons by 4 or so, that may not be much of a win if the cost of a byte comparison is similar to the cost of a word comparision.

更多推荐

二元序列检测器(binary sequence detector)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表