另一个类而不是SHA1Managed使得Checksum的长度字节少于128个(Another class instead of SHA1Managed to making Checksum'

编程入门行业动态更新时间:2024-10-22 16:25:56

另一个类而不是SHA1Managed使得Checksum的长度字节少于128个(Another class instead of SHA1Managed to making Checksum's with fewer than 128 length bytes)

我有一个表有一列（AbsoluteUrl NVARCHAR（2048）），我想查询这个列，所以这需要很长时间来比较每个记录与我自己的字符串。至少这个表有1000000条记录。

现在我认为有更好的解决方案为每个AbsoluteUrl制作校验和，并与校验和一起比较，而不是与AbsoluteUrl列。所以我使用下面的方法来生成校验和。但我希望另一个类使用少于128个长度字节来制作校验和。

public static byte[] GenerateChecksumAsByte(string content) { var buffer = Encoding.UTF8.GetBytes(content); return new SHA1Managed().ComputeHash(buffer); }

这种做法对我的工作有益吗？

UPDATE

根据答案，我想更深入地解释。所以实际上我正在使用非常简单的Web搜索引擎。如果我想简要解释我必须说什么时候提取网页的所有网址（找到网址的集合），那么我将把它编入Urls表的索引。

UrlId uniqueidentifier NotNull主键（Clustered Index）AbsoluteUrl nvarchar（2048）NoyNull Checksum varbinary（128）NotNull

所以我首先搜索表格，看看我是否有相同的网址，之前是否编入索引。如果没有，那么创建新记录。

public Url Get(byte[] checksum) { return _dataContext.Urls.SingleOrDefault(url => url.Checksum == checksum); //Or querying by AbsoluteUrl field }

并保存方法。

public void Save(Url url) { if (url == null) throw new ArgumentNullException("url"); var origin = _dataContext.Urls.GetOriginalEntityState(url); if (origin == null) { _dataContext.Urls.Attach(url); _dataContext.Refresh(RefreshMode.KeepCurrentValues, url); } else _dataContext.Urls.InsertOnSubmit(url); _dataContext.SubmitChanges(); }

例如，如果在一个页面上我找到2000个网址，我必须搜索2000次。

i have a table that have one column (AbsoluteUrl NVARCHAR(2048)) and i want to querying on this column, so this took long time to comparing each records with my own string. at least this table have 1000000 records.

Now i think there is better solution to making a checksum for each AbsoluteUrl and compare to checksum together instead of to AbsoluteUrl column. so i'm use below method to generate checksum. but i want another class to making checksum's with fewer than 128 length bytes.

public static byte[] GenerateChecksumAsByte(string content) { var buffer = Encoding.UTF8.GetBytes(content); return new SHA1Managed().ComputeHash(buffer); }

And is this approach good for my work?

UPDATE

According to answers, i want to explain in more depth. so actually I'm work on very simple Web Search Engine. If I want to briefly explain that I have to say when all of urls of web page are extracted (collection of found urls) then I'm going to index that to Urls table.

UrlId uniqueidentifier NotNull Primary Key (Clustered Index) AbsoluteUrl nvarchar(2048) NoyNull Checksum varbinary(128) NotNull

So i first search the table to if i have same url which is indexed before or not. if not then create new record.

public Url Get(byte[] checksum) { return _dataContext.Urls.SingleOrDefault(url => url.Checksum == checksum); //Or querying by AbsoluteUrl field }

And Save method.

For example if on one page i found 2000 urls, i must search for 2000 times.

最满意答案

不，这不是一个好方法。

一百万条记录对于索引字段来说没什么大不了的。另一方面，任何校验和/哈希/你生成的任何东西都会因为鸽子原则（又称生日悖论）而产生误报。使它更大可以减少但不能消除这种机会，但它确实会使速度降低到没有速度增加的程度。

只需在该字段上打一个索引，看看会发生什么。

No, this is not a good approach.

A million records is no big deal for an indexed field. On the other hand, any checksum/hash/whatever you generate is capable of false positives due to the pigeonhole principle (aka birthday paradox). Making it bigger reduces but does not eliminate this chance, but it does slow things down to the point where there will be no speed increase.

Just slap an index on the field and see what happens.

更多推荐

本文发布于:2023-08-07 09:25:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1463783.html