有没有人知道一个优化的方式来检测一个二进制数据块中的37位序列是最优的。 当然我可以使用窗口进行强力比较(只需比较索引0 +下一个36位,增量和循环直到找到它),但有没有更好的方法? 也许一些哈希搜索返回序列位于二进制块内的概率? 或者我只是把它从我的屁股中抽出来? 无论如何,我正在进行蛮力搜索,但我很好奇是否有更优化的东西。 顺便说一下,这在C中。
Does anyone know of an optimized way of detecting a 37 bit sequence in a chunk of binary data that is optimal. Sure I can do a brute force compare using windowing (just compare starting with index 0+next 36 bits, increment and loop until i find it) but is there a better way? Maybe some hashing search that returns a probability that the sequence lies within a binary chunk? Or am I just pulling that out of my butt? Anyway, I'm going ahead with the brute force search, but I was curious if there was something more optimal. This is in C by the way.
最满意答案
有趣的问题。 我假设你的37位序列可以在任何一个字节开始。 假设你的序列是这样表示的:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@如果我们有一个字节对齐的算法,我们可以看到这些32位序列字节:
BCDEFGHIJKLMNOPQRSTUVWXYZ0123456 [call this pattern w_A] CDEFGHIJKLMNOPQRSTUVWXYZ01234567 [w_B, etc.] DEFGHIJKLMNOPQRSTUVWXYZ012345678 EFGHIJKLMNOPQRSTUVWXYZ0123456789 FGHIJKLMNOPQRSTUVWXYZ0123456789@ GHIJKLMNOPQRSTUVWXYZ0123456789@x HIJKLMNOPQRSTUVWXYZ0123456789@xx IJKLMNOPQRSTUVWXYZ0123456789@xxx只有这些字节值 - 不能是其他字节值 - 才能形成包含37位感兴趣的字节序列的第二个第三和第四个字节。
这导致了一个相当明显的实现:
unsigned char *p = ...; // input data size_t n = ...; // bytes available size_t bitpos; --n; p++; bitpos = 0; while (n--) { uint32_t word = *(uint32_t*)p; // nonportable, sorry. bitpos += 8; // compiler should be able to optimise this variable out completely if (word == w_A) { if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) { // we found the data starting at the 8th bit of p-1 found_at(bitpos-1); } } else if (word == w_B) { if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) { // we found the data starting at the 7th bit of p-1 found_at(bitpos-2); } } else if (word == w_C} { ... } ... }显然这个策略存在问题。 首先,它可能需要在循环的第一次评估p [-1],但这很容易解决。 其次,它从奇数地址获取一个单词; 不适用于某些CPU - 例如SPARC和68k。 但是这样做是将4个比较合并为一个的简单方法。
kek444的建议将允许您使用像KMP这样的算法在数据流中跳过前进。 然而,跳过的最大尺寸并不是很大,所以尽管Turbo Boyer-Moore算法可能会将字节比较的数量减少4个左右,但如果字节比较的成本类似于单词比较的成本。
Interesting question. I assume your 37-bit sequence can begin at any point in a byte. Let's say your sequence is represented by this:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@If we have a byte aligned algorithm, we could see these 32-bit sequences bytes:
BCDEFGHIJKLMNOPQRSTUVWXYZ0123456 [call this pattern w_A] CDEFGHIJKLMNOPQRSTUVWXYZ01234567 [w_B, etc.] DEFGHIJKLMNOPQRSTUVWXYZ012345678 EFGHIJKLMNOPQRSTUVWXYZ0123456789 FGHIJKLMNOPQRSTUVWXYZ0123456789@ GHIJKLMNOPQRSTUVWXYZ0123456789@x HIJKLMNOPQRSTUVWXYZ0123456789@xx IJKLMNOPQRSTUVWXYZ0123456789@xxxOnly these byte values - no others -could form the second third and fourth byte of a byte sequence containing the 37 bits of interest.
This leads to a reasonably obvious implementation:
unsigned char *p = ...; // input data size_t n = ...; // bytes available size_t bitpos; --n; p++; bitpos = 0; while (n--) { uint32_t word = *(uint32_t*)p; // nonportable, sorry. bitpos += 8; // compiler should be able to optimise this variable out completely if (word == w_A) { if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) { // we found the data starting at the 8th bit of p-1 found_at(bitpos-1); } } else if (word == w_B) { if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) { // we found the data starting at the 7th bit of p-1 found_at(bitpos-2); } } else if (word == w_C} { ... } ... }Obviously there are problems with this strategy. First, it might want to evaluate p[-1] first time around the loop, but that's easy to fix. Second, it fetches a word from odd addresses; that wont work on some CPUs - SPARC and 68k for example. But doing so is an easy way to roll 4 comparisons into one.
kek444's suggestion would allow you to use a algorithm like KMP to skip forward in the data stream. However, the maximum size of the skip is not huge, so while the Turbo Boyer-Moore algorithm may reduce the number of byte comparisons by 4 or so, that may not be much of a win if the cost of a byte comparison is similar to the cost of a word comparision.
更多推荐
发布评论