二元序列检测器(binary sequence detector)

编程入门 行业动态 更新时间:2024-10-25 20:20:46
二元序列检测器(binary sequence detector)

有没有人知道一个优化的方式来检测一个二进制数据块中的37位序列是最优的。 当然我可以使用窗口进行强力比较(只需比较索引0 +下一个36位,增量和循环直到找到它),但有没有更好的方法? 也许一些哈希搜索返回序列位于二进制块内的概率? 或者我只是把它从我的屁股中抽出来? 无论如何,我正在进行蛮力搜索,但我很好奇是否有更优化的东西。 顺便说一下,这在C中。

Does anyone know of an optimized way of detecting a 37 bit sequence in a chunk of binary data that is optimal. Sure I can do a brute force compare using windowing (just compare starting with index 0+next 36 bits, increment and loop until i find it) but is there a better way? Maybe some hashing search that returns a probability that the sequence lies within a binary chunk? Or am I just pulling that out of my butt? Anyway, I'm going ahead with the brute force search, but I was curious if there was something more optimal. This is in C by the way.

最满意答案

有趣的问题。 我假设你的37位序列可以在任何一个字节开始。 假设你的序列是这样表示的:

ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@

如果我们有一个字节对齐的算法,我们可以看到这些32位序列字节:

BCDEFGHIJKLMNOPQRSTUVWXYZ0123456 [call this pattern w_A] CDEFGHIJKLMNOPQRSTUVWXYZ01234567 [w_B, etc.] DEFGHIJKLMNOPQRSTUVWXYZ012345678 EFGHIJKLMNOPQRSTUVWXYZ0123456789 FGHIJKLMNOPQRSTUVWXYZ0123456789@ GHIJKLMNOPQRSTUVWXYZ0123456789@x HIJKLMNOPQRSTUVWXYZ0123456789@xx IJKLMNOPQRSTUVWXYZ0123456789@xxx

只有这些字节值 - 不能是其他字节值 - 才能形成包含37位感兴趣的字节序列的第二个第三和第四个字节。

这导致了一个相当明显的实现:

unsigned char *p = ...; // input data size_t n = ...; // bytes available size_t bitpos; --n; p++; bitpos = 0; while (n--) { uint32_t word = *(uint32_t*)p; // nonportable, sorry. bitpos += 8; // compiler should be able to optimise this variable out completely if (word == w_A) { if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) { // we found the data starting at the 8th bit of p-1 found_at(bitpos-1); } } else if (word == w_B) { if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) { // we found the data starting at the 7th bit of p-1 found_at(bitpos-2); } } else if (word == w_C} { ... } ... }

显然这个策略存在问题。 首先,它可能需要在循环的第一次评估p [-1],但这很容易解决。 其次,它从奇数地址获取一个单词; 不适用于某些CPU - 例如SPARC和68k。 但是这样做是将4个比较合并为一个的简单方法。

kek444的建议将允许您使用像KMP这样的算法在数据流中跳过前进。 然而,跳过的最大尺寸并不是很大,所以尽管Turbo Boyer-Moore算法可能会将字节比较的数量减少4个左右,但如果字节比较的成本类似于单词比较的成本。

Interesting question. I assume your 37-bit sequence can begin at any point in a byte. Let's say your sequence is represented by this:

ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@

If we have a byte aligned algorithm, we could see these 32-bit sequences bytes:

BCDEFGHIJKLMNOPQRSTUVWXYZ0123456 [call this pattern w_A] CDEFGHIJKLMNOPQRSTUVWXYZ01234567 [w_B, etc.] DEFGHIJKLMNOPQRSTUVWXYZ012345678 EFGHIJKLMNOPQRSTUVWXYZ0123456789 FGHIJKLMNOPQRSTUVWXYZ0123456789@ GHIJKLMNOPQRSTUVWXYZ0123456789@x HIJKLMNOPQRSTUVWXYZ0123456789@xx IJKLMNOPQRSTUVWXYZ0123456789@xxx

Only these byte values - no others -could form the second third and fourth byte of a byte sequence containing the 37 bits of interest.

This leads to a reasonably obvious implementation:

unsigned char *p = ...; // input data size_t n = ...; // bytes available size_t bitpos; --n; p++; bitpos = 0; while (n--) { uint32_t word = *(uint32_t*)p; // nonportable, sorry. bitpos += 8; // compiler should be able to optimise this variable out completely if (word == w_A) { if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) { // we found the data starting at the 8th bit of p-1 found_at(bitpos-1); } } else if (word == w_B) { if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) { // we found the data starting at the 7th bit of p-1 found_at(bitpos-2); } } else if (word == w_C} { ... } ... }

Obviously there are problems with this strategy. First, it might want to evaluate p[-1] first time around the loop, but that's easy to fix. Second, it fetches a word from odd addresses; that wont work on some CPUs - SPARC and 68k for example. But doing so is an easy way to roll 4 comparisons into one.

kek444's suggestion would allow you to use a algorithm like KMP to skip forward in the data stream. However, the maximum size of the skip is not huge, so while the Turbo Boyer-Moore algorithm may reduce the number of byte comparisons by 4 or so, that may not be much of a win if the cost of a byte comparison is similar to the cost of a word comparision.

更多推荐

本文发布于:2023-08-06 05:06:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1444415.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:检测器   序列   binary   sequence   detector

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!