计算UTF8字符串的MD5哈希值

编程入门 行业动态 更新时间:2024-10-28 20:17:05
本文介绍了计算UTF8字符串的MD5哈希值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我有一个SQL表,其中存储大字符串值必须是唯一的。 为了确保唯一性,我在一个列上存储一个唯一索引,其中我存储大字符串的MD5哈希的字符串表示。

保存这些记录的C#应用​​程序使用以下方法进行散列:

public static string CreateMd5HashString(byte [] input) { var hashBytes = MD5.Create()。ComputeHash(input); return string.Join(,hashBytes.Select(b => b.T​​oString(X))); }

为了调用这个,我先转换字符串到 byte [] 使用UTF-8编码:

//这是我在我的应用程序中使用 CreateMd5HashString(Encoding.UTF8.GetBytes(abc)) //结果:90150983CD24FB0D6963F7D28E17F72

现在我希望能够在SQL中实现这个哈希函数,使用 HASHBYTES 功能,但我得到一个不同的值:

print hashbytes('md5',N'abc') - 结果:0xCE1473CF80C6B3FDA8E3DFC006ADC315

这是因为SQL计算字符串的UTF-16表示的MD5 。 如果我执行 CreateMd5HashString(Encoding.Unicode.GetBytes(abc)),则C#中得到相同的结果。

我无法更改应用程序中的散列方式。

有没有办法让SQL Server计算UTF- 8字节的字符串?

我查了类似的问题,我尝试使用归类,但迄今没有运气。

解决方案

您需要创建一个UDF才能将NVARCHAR数据转换为UTF-8表示形式的字节。说它被称为 dbo.NCharToUTF8Binary 然后你可以做:

hashbytes ('md5',dbo.NCharToUTF8Binary(N'abc',1))

这是一个UDF这样做:

创建函数dbo.NCharToUTF8Binary(@txt NVARCHAR(max),@modified位)返回varbinary(max)作为 begin - 注意:这不是最快的例程。 - 如果你想要一个快速例程,使用SQLCLR set @modified = isnull(@modified,0) - 首先切成一个表。 声明@chars表( ix int identity主键, codepoint int, utf8 varbinary(6)) declare @ix int set @ix = 0 while @ix< datalength(@txt)/ 2 - 尾随空格 begin set @ix = @ix + 1 insert @chars(codepoint) select unicode(substring(@txt, @ix,1)) end - 现在寻找代理对。 - 如果我们找到一对(铅跟踪跟踪),我们将配对 - 高代理是\\\�到\\\� - 低代理是\\\�到\\\� - 查找高替代码,然后是低代理,并更新代码点更新c1 set codepoint =((c1.codepoint& 0x07ff)* 0x0800)+(c2.codepoint& 0x07ff )+ 0x10000 from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 其中c1.codepoint> = 0xD800和c1.codepoint< = 0xDBFF 和c2.codepoint> = 0xDC00和c2.codepoint< = 0xDFFF - 摆脱找到的对象的尾部一半从@chars c1删除c2 内部连接@ c1.ix = c2.ix -1 上的chars c2其中c1.codepoint> = 0x10000 - 现在我们utf-8对每个代码点进行编码。 - 孤独的代理一半仍然在这里 - 所以它们将被编码,就像它们不是代理对。 更新c set utf8 = case - 一字节编码(修改后的UTF8作为两字节编码输出零)当代码点<= 0x7f和(@modified = 0 OR codepoint<> 0)然后cast(substring(cast(codepoint as binary(4)),4,1)作为varbinary(6)) - 字符编码当代码点<= 0x07ff 然后substring(cast((0x00C0 +((codepoint / 0x40)& 0x1f))作为二进制(4)),4,1) +子字符串((0x0080 +(codepoint& 0x3f))作为二进制(4)),4,1) - 三字节编码当代码点<= 0x0ffff 然后将子串((0x00E0 +((codepoint / 0x1000)& 0x0f))作为二进制(4)),4,1) + substring(cast((0x0080 +((codepoint / 0x40)& ; 0x3f))作为二进制(4)),4,1) + substring(cast((0x0080 +(codepoint& 0x3f))as binary(4)),4,1) - - 四字节编码当代码点<= 0x1FFFFF then substring(cast((0x00F0 +((codepoint / 0x00040000)&am磷; (4)),4,1) + substring(cast((0x0080 +((codepoint / 0x1000)& 0x3f))作为二进制(4)),4,1) + substring(cast((0x0080 +((codepoint / 0x40)& 0x3f))作为二进制(4)),4,1) + substring(cast((0x0080 +(codepoint& 0x3f) )作为二进制(4)),4,1) end from @chars c - 最后连接它们并返回。 declare @ret varbinary(max) set @ret = cast(''as varbinary(max)) select @ret = @ret + utf8 from @chars c order by ix return @ret end

I have an SQL table in which I store large string values that must be unique. In order to ensure the uniqueness, I have a unique index on a column in which I store a string representation of the MD5 hash of the large string.

The C# app that saves these records uses the following method to do the hashing:

public static string CreateMd5HashString(byte[] input) { var hashBytes = MD5.Create().ComputeHash(input); return string.Join("", hashBytes.Select(b => b.ToString("X"))); }

In order to call this, I first convert the string to byte[] using the UTF-8 encoding:

// this is what I use in my app CreateMd5HashString(Encoding.UTF8.GetBytes("abc")) // result: 90150983CD24FB0D6963F7D28E17F72

Now I would like to be able to implement this hashing function in SQL, using the HASHBYTES function, but I get a different value:

print hashbytes('md5', N'abc') -- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315

This is because SQL computes the MD5 of the UTF-16 representation of the string. I get the same result in C# if I do CreateMd5HashString(Encoding.Unicode.GetBytes("abc")).

I cannot change the way hashing is done in the application.

Is there a way to get SQL Server to compute the MD5 hash of the UTF-8 bytes of the string?

I looked up similar questions, I tried using collations, but had no luck so far.

解决方案

You need to create a UDF to convert the NVARCHAR data to bytes in UTF-8 Representation. Say it is called dbo.NCharToUTF8Binary then you can do:

hashbytes('md5', dbo.NCharToUTF8Binary(N'abc', 1))

Here is a UDF which will do that:

create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit) returns varbinary(max) as begin -- Note: This is not the fastest possible routine. -- If you want a fast routine, use SQLCLR set @modified = isnull(@modified, 0) -- First shred into a table. declare @chars table ( ix int identity primary key, codepoint int, utf8 varbinary(6) ) declare @ix int set @ix = 0 while @ix < datalength(@txt)/2 -- trailing spaces begin set @ix = @ix + 1 insert @chars(codepoint) select unicode(substring(@txt, @ix, 1)) end -- Now look for surrogate pairs. -- If we find a pair (lead followed by trail) we will pair them -- High surrogate is \uD800 to \uDBFF -- Low surrogate is \uDC00 to \uDFFF -- Look for high surrogate followed by low surrogate and update the codepoint update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000 from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF -- Get rid of the trailing half of the pair where found delete c2 from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 where c1.codepoint >= 0x10000 -- Now we utf-8 encode each codepoint. -- Lone surrogate halves will still be here -- so they will be encoded as if they were not surrogate pairs. update c set utf8 = case -- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding) when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0) then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6)) -- Two-byte encodings when codepoint <= 0x07ff then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) -- Three-byte encodings when codepoint <= 0x0ffff then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) -- Four-byte encodings when codepoint <= 0x1FFFFF then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) end from @chars c -- Finally concatenate them all and return. declare @ret varbinary(max) set @ret = cast('' as varbinary(max)) select @ret = @ret + utf8 from @chars c order by ix return @ret end

更多推荐

计算UTF8字符串的MD5哈希值

本文发布于:2023-10-22 22:49:30,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1518959.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符串   哈希值

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!