是否有用于 Delphi 2010 字符串 (UnicodeString) 的 Boyer

编程入门 行业动态 更新时间:2024-10-13 12:18:26
本文介绍了是否有用于 Delphi 2010 字符串 (UnicodeString) 的 Boyer-Moore 字符串搜索和快速搜索和替换功能以及快速字符串计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我需要三个快速处理大字符串的函数:快速搜索、快速搜索和替换以及快速计算字符串中的子字符串.

I need three fast-on-large-strings functions: fast search, fast search and replace, and fast count of substrings in a string.

我在 C++ 和 Python 中遇到过 Boyer-Moore 字符串搜索,但我发现的唯一用于实现快速搜索和替换的 Delphi Boyer-Moore 算法是 Peter Morris 的 FastStrings 的一部分,以前是 DroopyEyes 软件,他的网站和电子邮件不再有效.

I have run into Boyer-Moore string searches in C++ and Python, but the only Delphi Boyer-Moore algorithm used to implement fast search and replace that I have found is part of the FastStrings by Peter Morris, formerly of DroopyEyes software, and his website and email are no longer working.

我已经移植了 FastStrings 以便在 Delphi 2009/2010 中为 AnsiStrings 工作得很好,其中一个字节等于一个 AnsiChar,但使它们也与 Delphi 2010 中的字符串 (UnicodeString) 一起使用似乎很重要.

I have already ported FastStrings forward to work great for AnsiStrings in Delphi 2009/2010, where a byte is equal to one AnsiChar, but making them also work with the String (UnicodeString) in Delphi 2010 appears non-trivial.

使用这种 Boyer-Moore 算法,应该可以轻松进行不区分大小写的搜索,以及不区分大小写的搜索和替换,无需任何临时字符串(使用 StrUpper 等),并且无需调用较慢的 Pos()当需要对同一文本进行重复搜索时,比 Boyer-Moore 搜索要好.

Using this Boyer-Moore algorithm, it should be possible to easily do case insensitive searches, as well as case-insensitive search and replace, without any temporary string (using StrUpper etc), and without calling Pos() which is slower than Boyer-Moore searching when repeated searches over the same text are required.

(我有一个部分解决方案,作为这个问题的答案,它几乎 100% 完成,它甚至有一个快速的字符串替换功能.我相信它一定有错误,特别是因为它假装要成为 Unicode 能力,一定是由于未履行 Unicode 承诺而出现故障.)

( I have a partial solution, written as an answer to this question, it is almost 100% complete, it even has a fast string replace function. I believe it MUST have bugs, and especially think that since it pretends to be Unicode capable that it must be that there are glitches due to unfulfilled Unicode promises. )

(Edit2:有趣且出乎意料的结果;堆栈上 unicode 代码点表的大堆栈大小 - 下面代码中的 SkipTable 严重阻碍了您可以在此处进行的双赢优化的数量unicode string boyer-moore 字符串搜索.感谢 Florent Ouchet 指出我应该立即注意到的内容.)

( Interesting and unexpected result; The large stack size of a unicode code-point table on the stack - SkipTable in the code below puts a serious damper on the amount of win-win-optimization you can do here in a unicode string boyer-moore string search. Thanks to Florent Ouchet for pointing out what I should have noticed immediately.)

推荐答案

这个答案现在已经完成并且适用于区分大小写的模式,但不适用于不区分大小写的模式,并且可能还有其他错误,因为它不是很好的单元经过测试,可能会进一步优化,例如我重复了本地函数 __SameChar 而不是使用比较函数回调,这本来会更快,实际上,允许用户为所有这些传递比较函数对 Unicode 来说非常棒想要提供一些额外逻辑的用户(某些语言的等效 Unicode 字形集).

This answer is now complete and works for case sensitive mode, but does not work for case insensitive mode, and probably has other bugs too, since it's not well unit tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages).

基于 Dorin Dominica 的代码,我构建了以下内容.

Based on Dorin Dominica's code, I built the following.

{ _FindStringBoyer: Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM. Credited to Dorin Duminica. } function _FindStringBoyer(const sString, sPattern: string; const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if bCaseSensitive then Result := (sString[StringIndex] = sPattern[PatternIndex]) else Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; var SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; begin if fromPos < 1 then raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]); LengthPattern := Length(sPattern); LengthString := Length(sString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[sPattern[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[sPattern[LengthPattern]]; SkipTable[sPattern[LengthPattern]] := Large; Index := fromPos + LengthPattern -1; Result := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[sString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin // Found, return. Result := Index - kIndex + 1; Index := Index + LengthPattern; exit; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[sString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; { Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of a substring inside the main string, at a much faster rate than we could have done otherwise. Another thing that would be great is to have a function that returns an array of find-locations, which would be way faster to do than repeatedly calling Pos. } function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer; var foundPos:Integer; fromPos:Integer; Limit:Integer; guard:Integer; SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if CaseSensitive then Result := (aSourceString[StringIndex] = aFindString[PatternIndex]) else Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin result := 0; foundPos := 1; fromPos := 1; Limit := Length(aSourceString); guard := Length(aFindString); Index := 0; LengthPattern := Length(aFindString); LengthString := Length(aSourceString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[aFindString[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[aFindString[LengthPattern]]; SkipTable[aFindString[LengthPattern]] := Large; while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin Index := fromPos + LengthPattern -1; if Index>Limit then break; kIndex := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[aSourceString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin // Found, return. //Result := Index - kIndex + 1; Index := Index + LengthPattern; fromPos := Index; Inc(Result); break; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[aSourceString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; end;

这真是一个不错的算法,因为:

This is really a nice Algorithm, because:

  • 以这种方式计算字符串 Y 中子字符串 X 的实例要快得多,非常棒.
  • 仅仅替换 Pos(),_FindStringBoyer() 比 FastCode 项目人员贡献给 Delphi 的 Pos() 的纯 asm 版本更快,目前用于 Pos,如果你需要不区分大小写,你可以想象一下当我们不必在 100 兆字节的字符串上调用 UpperCase 时的性能提升.(好吧,你的字符串不会那么大.但是,高效的算法仍然是一种美.)

好吧,我写了一个 Boyer-Moore 风格的字符串替换:

Okay I wrote a String Replace in Boyer-Moore style:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String; var errors:Integer; fromPos:Integer; Limit:Integer; guard:Integer; SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; CaseSensitive:Boolean; foundAt:Integer; lastFoundAt:Integer; copyStartsAt:Integer; copyLen:Integer; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if CaseSensitive then Result := (aSourceString[StringIndex] = aFindString[PatternIndex]) else Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin result := ''; lastFoundAt := 0; fromPos := 1; errors := 0; CaseSensitive := rfIgnoreCase in Flags; Limit := Length(aSourceString); guard := Length(aFindString); Index := 0; LengthPattern := Length(aFindString); LengthString := Length(aSourceString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[aFindString[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[aFindString[LengthPattern]]; SkipTable[aFindString[LengthPattern]] := Large; while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin Index := fromPos + LengthPattern -1; if Index>Limit then break; kIndex := 0; foundAt := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[aSourceString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin foundAt := Index - kIndex + 1; Index := Index + LengthPattern; //fromPos := Index; fromPos := (foundAt+LengthPattern); if lastFoundAt=0 then begin copyStartsAt := 1; copyLen := foundAt-copyStartsAt; end else begin copyStartsAt := lastFoundAt+LengthPattern; copyLen := foundAt-copyStartsAt; end; if (copyLen<=0)or(copyStartsAt<=0) then begin Inc(errors); end; Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString; lastFoundAt := foundAt; if not (rfReplaceAll in Flags) then fromPos := 0; // break out of outer while loop too! break; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[aSourceString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; if (lastFoundAt=0) then begin // nothing was found, just return whole original string Result := aSourceString; end else if (lastFoundAt+LengthPattern < Limit) then begin // the part that didn't require any replacing, because nothing more was found, // or rfReplaceAll flag was not specified, is copied at the // end as the final step. copyStartsAt := lastFoundAt+LengthPattern; copyLen := Limit; { this number can be larger than needed to be, and it is harmless } Result := Result + Copy(aSourceString, copyStartsAt, copyLen ); end; end;

好的,问题:这个的堆栈占用:

Okay, problem: Stack footprint of this:

var skiptable : array [Char] of Integer; // 65536*4 bytes stack usage on Unicode delphi

再见 CPU 地狱,你好堆栈地狱.如果我使用动态数组,那么我必须在运行时调整它的大小.所以这个东西基本上很快,因为你计算机上的虚拟内存系统不会在堆栈上的 256K 时闪烁,但这并不总是最佳的代码段.尽管如此,我的电脑不会在像这样的大堆栈东西上闪烁.它不会成为 Delphi 标准库的默认设置,也不会在未来赢得任何 fastcode 挑战,因为它有点占用空间.我认为重复搜索是上面代码应该写成一个类的情况,而skiptable应该是那个类中的一个数据字段.然后您可以构建一次boyer-moore 表,随着时间的推移,如果字符串不变,则重复使用该对象进行快速查找.

Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.

更多推荐

是否有用于 Delphi 2010 字符串 (UnicodeString) 的 Boyer

本文发布于:2023-11-30 09:08:19,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1649466.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符串   Delphi   UnicodeString   Boyer

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!