VBA 哈希字符串

编程入门 行业动态 更新时间:2024-10-26 21:19:57
本文介绍了VBA 哈希字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 Excel VBA 获取长字符串的短哈希

How do I get a short hash of a long string using Excel VBA

给出了什么

输入字符串不超过 80 个字符有效输入字符为: [0..9] [A_Z] ._/有效输出字符为 [0..9] [A_Z] [a_z] (大小写均可)输出哈希不应超过 ~12 个字符(越短越好)根本不需要是唯一的,因为这会导致哈希过长

到目前为止我所做的

我认为 这个 SO 答案 是一个好的开始,因为它生成了一个 4 位十六进制代码 (CRC16).

I thought this SO answer is a good start since it generates a 4-digit Hex-Code (CRC16).

但是 4 位数太少了.在我对 400 个字符串的测试中,20% 在其他地方得到了重复.
产生碰撞的机会太高.

But 4 digits were to little. In my test with 400 strings 20% got a duplicate somewhere else.
The chance to generate a collision is too high.

Sub tester()
    For i = 2 To 433
        Cells(i, 2) = CRC16(Cells(i, 1))
    Next i
End Sub


Function CRC16(txt As String)
Dim x As Long
Dim mask, i, j, nC, Crc As Integer
Dim c As String

Crc = &HFFFF

For nC = 1 To Len(txt)
    j = Val("&H" + Mid(txt, nC, 2))
    Crc = Crc Xor j
    For j = 1 To 8
        mask = 0
        If Crc / 2 <> Int(Crc / 2) Then mask = &HA001
        Crc = Int(Crc / 2) And &H7FFF: Crc = Crc Xor mask
    Next j
Next nC

CRC16 = Hex$(Crc)
End Function

如何复制

您可以从 pastebin 复制这 400 个测试字符串.
将它们粘贴到新 Excel 工作簿中的 A 列并执行上面的代码.

You can copy these 400 test strings from pastebin.
Paste them to column A in a new Excel workbook and execute the code above.

问:我如何获得足够短(12 个字符)和足够长的字符串哈希以获取少量重复项.

Q: How do I get a string hash which is short enough (12 chars) and long enough to get a small percentage of duplicates.

推荐答案

将您的字符串拆分为三个较短的字符串(如果不能被 3 整除,则最后一个将比其他两个长).对每个运行短"算法,并将结果连接起来.

Split your string into three shorter strings (if not divisible by three, the last one will be longer than the other two). Run your "short" algorithm on each, and concatenate the results.

我可以编写代码,但根据问题的质量,我认为您可以从这里获取!

I could write the code but based on the quality of the question I think you can take it from here!

事实证明,这个建议是不够​​的.您的原始 CRC16 代码中存在严重缺陷 - 即以下行:

It turns out that that advice is not enough. There is a serious flaw in your original CRC16 code - namely the line that says:

j = Val("&H" + Mid(txt, nC, 2))

这只处理可以解释为十六进制值的文本:小写和大写字母相同,字母表中 F 之后的任何内容都将被忽略(据我所知).任何好的东西出来都是一个奇迹.如果你用

This only handles text that can be interpreted as hex values: lowercase and uppercase letters are the same, and anything after F in the alphabet is ignored (as far as I can tell). That anything good comes out at all is a miracle. If you replace the line with

j = asc(mid(txt, nC, 1))

事情变得更好了 - 每个 ASCII 代码至少从生命开始都是作为它自己的价值.

Things work better - every ASCII code at least starts out life as its own value.

将此更改与我之前提出的建议相结合,您将获得以下代码:

Combining this change with the proposal I made earlier, you get the following code:

Function hash12(s As String)
' create a 12 character hash from string s

Dim l As Integer, l3 As Integer
Dim s1 As String, s2 As String, s3 As String

l = Len(s)
l3 = Int(l / 3)
s1 = Mid(s, 1, l3)      ' first part
s2 = Mid(s, l3 + 1, l3) ' middle part
s3 = Mid(s, 2 * l3 + 1) ' the rest of the string...

hash12 = hash4(s1) + hash4(s2) + hash4(s3)

End Function

Function hash4(txt)
' copied from the example
Dim x As Long
Dim mask, i, j, nC, crc As Integer
Dim c As String

crc = &HFFFF

For nC = 1 To Len(txt)
    j = Asc(Mid(txt, nC)) ' <<<<<<< new line of code - makes all the difference
    ' instead of j = Val("&H" + Mid(txt, nC, 2))
    crc = crc Xor j
    For j = 1 To 8
        mask = 0
        If crc / 2 <> Int(crc / 2) Then mask = &HA001
        crc = Int(crc / 2) And &H7FFF: crc = crc Xor mask
    Next j
Next nC

c = Hex$(crc)

' <<<<< new section: make sure returned string is always 4 characters long >>>>>
' pad to always have length 4:
While Len(c) < 4
  c = "0" & c
Wend

hash4 = c

End Function

您可以将此代码作为 =hash12("A2") 等放置在电子表格中.为了好玩,您还可以使用新的、改进的"hash4 算法,并查看它们的比较.我创建了一个数据透视表来计算冲突 - hash12 算法没有,而 hash4 只有 3 个.我相信你可以弄清楚如何创建 hash8,...由此.您的问题中的无需唯一"表明也许改进的"hash4 就是您所需要的.

You can place this code in your spreadsheet as =hash12("A2") etc. For fun, you can also use the "new, improved" hash4 algorithm, and see how they compare. I created a pivot table to count collisions - there were none for the hash12 algorithm, and only 3 for the hash4. I'm sure you can figure out how to create hash8, ... from this. The "no need to be unique" from your question suggests that maybe the "improved" hash4 is all you need.

原则上,一个四字符的十六进制应该有 64k 个唯一值——所以两个随机字符串具有相同散列的几率是 64k 中的 1 个.当您有 400 个字符串时,有 400 x 399/2 个可能的碰撞对"~ 80k 机会(假设您有高度随机的字符串).因此,在样本数据集中观察三个碰撞并不是一个不合理的分数.随着字符串数量 N 的增加,冲突的概率与 N 的平方成正比. 有了 hash12 中额外的 32 位信息,您希望在 N > 20 M 左右时看到冲突(handwaving,in-my-头数学).

In principle, a four character hex should have 64k unique values - so the chance of two random strings having the same hash would be 1 in 64k. When you have 400 strings, there are 400 x 399 / 2 "possible collision pairs" ~ 80k opportunities (assuming you had highly random strings). Observing three collisions in the sample dataset is therefore not an unreasonable score. As your number of strings N goes up, the probability of collisions goes as the square of N. With the extra 32 bits of information in the hash12, you expect to see collisions when N > 20 M or so (handwaving, in-my-head-math).

显然,您可以使 hash12 代码更紧凑一些 - 并且应该很容易看到如何将其扩展到任意长度.

You can make the hash12 code a little bit more compact, obviously - and it should be easy to see how to extend it to any length.

哦——还有最后一件事.如果您启用了 RC 寻址,使用 =CRC16("string") 作为电子表格公式会产生难以跟踪的 #REF 错误......这就是为什么我重命名它hash4

Oh - and one last thing. If you have RC addressing enabled, using =CRC16("string") as a spreadsheet formula gives a hard-to-track #REF error... which is why I renamed it hash4

这篇关于VBA 哈希字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

更多推荐

[db:关键词]

本文发布于:2023-03-31 01:16:53,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/793596.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符串   VBA

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!