语言: vb 文件大小:1GB和其他东西。
文本文件的编码::UTF8(所以每个字符都由不同的字节数表示)。
校对:同样,最受欢迎的版本将是唯一的版本。)。我想我知道如何处理它。
由于每个字符都由不同的字节数表示,每行的字符数不同,每行也有所不同。
我想我们必须为每一行计算散列值。我们还需要将缓冲区的位置存储在每行的位置。然后我们必须比较缓冲区。然后,我们将检查是否显示相同的行。
是否有特别的功能最适合?
解决方案根据行的长度,您可以为每行和每一行计算一个MD5散列值,而不是在 HashMap :
使用sr作为新的StreamReader(myFile) Dim lines As New HashSet(Of String) Dim md5 As New Security.Cryptography.MD5Cng() 虽然sr.BaseStream.Position< sr.BaseStream.Length Dim l As String = sr.ReadLine() Dim hash As String = String.Join(String.Empty,md5.ComputeHash(System.Text.Encoding.UTF8.GetBytes( l))。Select(Function(x)x.ToString(x2))) 如果lines.Contains(hash)Then '行不是唯一的 Exit while else lines.Add(hash) End If End While End使用未经测试,但这可能足够满足您的需求。我想不出更快的东西,仍然保持一些简洁的表象:)
Language: vb File size: 1GB, and stuff.
Encoding of the text file: UTF8 (so each character is represented by different numbers of bytes).
Collation: UnicodeCI (when several characters are essentially the same, the most popular version will be the one unique.). I think I know how to handle t his one.
Because each character is represented by different numbers of bytes and each line has different numbers of characters, the number of bytes in each line also vary.
I suppose we have to compute hash for each line. We also need to store buffers location where the line each. Then we have to compare buffers. Then we will check whether the same line shows up or not.
Is there special functions best for that?
解决方案Depending on how long the lines are, you may be able to compute an MD5 hash value for each line and store than in a HashMap:
Using sr As New StreamReader("myFile") Dim lines As New HashSet(Of String) Dim md5 As New Security.Cryptography.MD5Cng() While sr.BaseStream.Position < sr.BaseStream.Length Dim l As String = sr.ReadLine() Dim hash As String = String.Join(String.Empty, md5.ComputeHash(System.Text.Encoding.UTF8.GetBytes(l)).Select(Function(x) x.ToString("x2"))) If lines.Contains(hash) Then 'Lines are not unique Exit While Else lines.Add(hash) End If End While End UsingUntested, but this may be fast enough for your needs. I can't think of something much faster that still maintains some semblance of conciseness :)
更多推荐
如果文件的大小非常大,如何确保文件在vb.net中具有唯一的行
发布评论