有没有一个Boyer

编程入门行业动态更新时间:2024-10-13 10:24:50

本文介绍了有没有一个Boyer-Moore字符串搜索和快速搜索和替换功能以及Delphi 2010 String（UnicodeString）的快速字符串计数？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我需要三个快速的大字符串功能：快速搜索，快速搜索和替换，以及字符串中子字符串的快速计数。

我已经运行使用C ++和Python中的Boyer-Moore字符串搜索，但是用于实现快速搜索和替换的唯一Delphi Boyer-Moore算法是Peter Morris以前的DroopyEyes软件的FastStrings的一部分，他的网站和电子邮件是不再工作

我已经将 FastStrings 转发在Delphi 2009/2010中为AnsiStrings工作很好，其中一个字节等于一个AnsiChar，但是它们也与Delphi 2010中的String（UnicodeString）一起工作似乎并不平凡。

使用这个Boyer-Moore算法，应该可以轻松地进行不区分大小写的搜索，以及不区分大小写的搜索和替换，而不需要任何临时字符串（使用StrUpper等等），并且不需要调用比Boyer-Moore搜索慢的Pos（）。当需要在同一文本上重复搜索时。

（编辑：我有一个部分解决方案作为这个问题的答案，它几乎完成了100％，甚至有一个快速的字符串替换功能，我相信它必须有错误，特别是认为，因为它假装是Unicode的能力，它必须是有（/）

（Edit2：有意思和意想不到的结果;堆栈上的unicode代码表的大堆栈大小 - 代码中的SkipTable下面给出了一个双重优化的数量，在这里你可以在一个unicode字符串boyer-moore字符串搜索中做得很好。感谢Florent Ouchet指出wh我应该立即注意到。）

解决方案

这个答案现在完成，除了代码不好（Unit）测试，并且可能会进一步优化，例如，我重复本地函数__SameChar，而不是使用比较快的函数回调，实际上允许用户通过比较函数对所有这些将是非常好的Unicode用户想提供一些额外的逻辑（某些语言的等效的Unicode字形集）。然而，我有一个基于Boyer-Moore的不区分大小写的Pos解决方案，并且更快的子串计数，现在也可以搜索和替换：

根据Dorin Dominica的代码，我建立以下内容。

{_FindStringBoyer：使用常规String而不是AnsiSTring的Boyer-Moore搜索算法，没有ASM 。归功于Dorin Duminica。 } 函数_FindStringBoyer（const sString，sPattern：string; const bCaseSensitive：Boolean = True; const fromPos：Integer = 1）：Integer; 函数__SameChar（StringIndex，PatternIndex：Integer）：Boolean; begin 如果bCaseSensitive然后结果：=（sString [StringIndex] = sPattern [PatternIndex]） else 结果：=（CompareText（sString [StringIndex] sPattern [PatternIndex]）= 0）; 结束// function __SameChar（StringIndex，PatternIndex：Integer）：Boolean; var SkipTable：Array [Char] of Integer; LengthPattern：Integer; LengthString：Integer; 索引：整数; kIndex：整数; LastMarker：Integer; 大：整数; chPattern：Char; begin 如果fromPos< 1 then raise Exception.CreateFmt（'Invalid search start position：％d。'，[fromPos]）; LengthPattern：= Length（sPattern）; LengthString：= Length（sString）; for chPattern：= Low（Char）to High（Char）do SkipTable [chPattern]：= LengthPattern; for Index：= 1 to LengthPattern -1 do SkipTable [sPattern [Index]]：= LengthPattern - Index; Large：= LengthPattern + LengthString + 1; LastMarker：= SkipTable [sPattern [LengthPattern]]; SkipTable [sPattern [LengthPattern]]：= Large; 索引：= fromPos + LengthPattern -1; 结果：= 0; while Index< = LengthString do begin repeat 索引：= Index + SkipTable [sString [Index]]; until Index> LengthString; 如果Index< = Large then Break else 索引：=索引 - 大; kIndex：= 1; while（kIndex< LengthPattern）和__SameChar（Index - kIndex，LengthPattern - kIndex）do Inc（kIndex）; 如果kIndex = LengthPattern然后开始 //找到，返回。结果：=索引 - kIndex + 1; 索引：= Index + LengthPattern; 退出; end else begin 如果__SameChar（Index，LengthPattern）然后索引：=索引+ LastMarker else 索引：=索引+ SkipTable [sString [索引]] ; 结束//如果kIndex = LengthPattern然后开始 end; // while Index< = LengthString do begin end; {由Warren编写，使用上述代码作为起始点，我们计算一次SkipTable，然后计算主字符串内a子串的实例数，速度快得多比我们可以这样做。另外一件事就是有一个返回一个find-locations数组的函数，这样做比反复调用Pos要快。 } 函数_StringCountBoyer（const aSourceString，aFindString：String; Const CaseSensitive：Boolean = TRUE）：Integer; var foundPos：Integer; fromPos：Integer; 限制：整数; guard：整数; SkipTable：Array [Char] of Integer; LengthPattern：Integer; LengthString：Integer; 索引：整数; kIndex：整数; LastMarker：Integer; 大：整数; chPattern：Char; 函数__SameChar（StringIndex，PatternIndex：Integer）：Boolean; begin 如果CaseSensitive然后结果：=（aSourceString [StringIndex] = aFindString [PatternIndex]） else 结果：=（CompareText（aSourceString [StringIndex] aFindString [PatternIndex]）= 0）; 结束// function __SameChar（StringIndex，PatternIndex：Integer）：Boolean; begin result：= 0; foundPos：= 1; fromPos：= 1; 限制：=长度（aSourceString）; guard：= Length（aFindString）; 索引：= 0; LengthPattern：= Length（aFindString）; LengthString：= Length（aSourceString）; for chPattern：= Low（Char）to High（Char）do SkipTable [chPattern]：= LengthPattern; for Index：= 1 to LengthPattern -1 do SkipTable [aFindString [Index]]：= LengthPattern - Index; Large：= LengthPattern + LengthString + 1; LastMarker：= SkipTable [aFindString [LengthPattern]]; SkipTable [aFindString [LengthPattern]]：= Large; while（foundPos> = 1）和（fromPos< Limit）和（Index< Limit）开始索引：= fromPos + LengthPattern -1; if Index> Limit then break; kIndex：= 0; while Index< = LengthString do begin repeat 索引：= Index + SkipTable [aSourceString [Index]]; until Index> LengthString; 如果Index< = Large then Break else 索引：=索引 - 大; kIndex：= 1; while（kIndex< LengthPattern）和__SameChar（Index - kIndex，LengthPattern - kIndex）do Inc（kIndex）; 如果kIndex = LengthPattern然后开始 //找到，返回。 //结果：=索引 - kIndex + 1; 索引：= Index + LengthPattern; fromPos：= Index; Inc（Result）; break; end else begin 如果__SameChar（Index，LengthPattern）然后索引：=索引+ LastMarker else 索引：=索引+ SkipTable [aSourceString [索引]] ; 结束//如果kIndex = LengthPattern然后开始 end; // while Index< = LengthString do begin end; 结束

这是一个很好的算法，因为：

为了更好地计算字符串Y中的子字符串X的实例，这个方法是非常好的。

只有替换Pos（）_FindStringBoyer（）比纯ASM版本的Pos（）由FastCode项目人员贡献给Delphi，目前用于Pos，如果您需要不区分大小写，您可以想像当我们不必调用UpperCase时，性能提升一个100兆字节的字符串。（好的，你的字符串不会很大，但仍然有效的算法是美的事情。）

好的我在Boyer-Moore风格中写了一个String Replace：

function _StringReplaceBoyer（const aSourceString，aFindString，aReplaceString：String; Flags：TReplaceFlags ）：String; var errors：Integer; fromPos：Integer; 限制：整数; guard：整数; SkipTable：Array [Char] of Integer; LengthPattern：Integer; LengthString：Integer; 索引：整数; kIndex：整数; LastMarker：Integer; 大：整数; chPattern：Char CaseSensitive：Boolean; foundAt：Integer; lastFoundAt：Integer; copyStartsAt：Integer; copyLen：整数; 函数__SameChar（StringIndex，PatternIndex：Integer）：Boolean; begin 如果CaseSensitive然后结果：=（aSourceString [StringIndex] = aFindString [PatternIndex]） else 结果：=（CompareText（aSourceString [StringIndex] aFindString [PatternIndex]）= 0）; 结束// function __SameChar（StringIndex，PatternIndex：Integer）：Boolean; begin result：=''; lastFoundAt：= 0; fromPos：= 1; errors：= 0; CaseSensitive：= rfIgnoreCase in Flags; 限制：=长度（aSourceString）; guard：= Length（aFindString）; 索引：= 0; LengthPattern：= Length（aFindString）; LengthString：= Length（aSourceString）; for chPattern：= Low（Char）to High（Char）do SkipTable [chPattern]：= LengthPattern; for Index：= 1 to LengthPattern -1 do SkipTable [aFindString [Index]]：= LengthPattern - Index; Large：= LengthPattern + LengthString + 1; LastMarker：= SkipTable [aFindString [LengthPattern]]; SkipTable [aFindString [LengthPattern]]：= Large; while（fromPos> = 1）和（fromPos <= Limit）和（Index< = Limit）开始索引：= fromPos + LengthPattern -1; if Index> Limit then break; kIndex：= 0; foundAt：= 0; while Index< = LengthString do begin repeat 索引：= Index + SkipTable [aSourceString [Index]]; until Index> LengthString; 如果Index< = Large then Break else 索引：=索引 - 大; kIndex：= 1; while（kIndex< LengthPattern）和__SameChar（Index - kIndex，LengthPattern - kIndex）do Inc（kIndex）; 如果kIndex = LengthPattern然后开始 foundAt：=索引 - kIndex + 1; 索引：= Index + LengthPattern; // fromPos：= Index; fromPos：=（foundAt + LengthPattern）; 如果lastFoundAt = 0，则开始 copyStartsAt：= 1; copyLen：= foundAt-copyStartsAt; end else begin copyStartsAt：= lastFoundAt + LengthPattern; copyLen：= foundAt-copyStartsAt; 结束如果（copyLen <= 0）或（copyStartsAt <= 0）则开始 Inc（errors）; 结束结果：= Result + Copy（aSourceString，copyStartsAt，copyLen）+ aReplaceString; lastFoundAt：= foundAt; 如果没有（rfReplaceAll在Flags）然后 fromPos：= 0; //打破外圈while循环！ break; end else begin 如果__SameChar（Index，LengthPattern）然后索引：=索引+ LastMarker else 索引：=索引+ SkipTable [aSourceString [索引]] ; 结束//如果kIndex = LengthPattern然后开始 end; // while Index< = LengthString do begin end; if（lastFoundAt = 0）then begin //没有找到，只返回整个原始字符串结果：= aSourceString; end else if（lastFoundAt + LengthPattern< Limit）然后开始 //不需要替换的部分，因为没有找到更多的 //或rfReplaceAll标志未被指定，复制在 // end作为最后一步。 copyStartsAt：= lastFoundAt + LengthPattern; copyLen：=限制; {这个数字可以大于需要的，它是无害的} 结果：= Result + Copy（aSourceString，copyStartsAt，copyLen）; 结束 end;

好的，问题：堆栈的占用：

var skiptable：Array [Char] of Integer; // 65536 * 4字节堆栈使用Unicode delphi

再见CPU地狱，你好堆栈地狱。如果我去一个动态数组，那么我必须在运行时调整它的大小。所以这个东西基本上是快的，因为你的计算机上的虚拟内存系统不会在256K的堆栈上闪烁，但这并不总是最好的代码。然而，我的电脑不会像这样的大堆东西闪烁。这将不会成为Delphi标准库默认或将来赢得任何快速编码的挑战，具有这种足迹。我认为重复的搜索是上述代码应该写成一个类的情况，而skiptable应该是该类中的一个数据字段。然后，您可以一次构建boyer-moore表，并且随着时间的推移，如果字符串不变，则反复使用该对象进行快速查找。

I need three fast-on-large-strings functions: fast search, fast search and replace, and fast count of substrings in a string.

I have run into Boyer-Moore string searches in C++ and Python, but the only Delphi Boyer-Moore algorithm used to implement fast search and replace that I have found is part of the FastStrings by Peter Morris, formerly of DroopyEyes software, and his website and email are no longer working.

I have already ported FastStrings forward to work great for AnsiStrings in Delphi 2009/2010, where a byte is equal to one AnsiChar, but making them also work with the String (UnicodeString) in Delphi 2010 appears non-trivial.

Using this Boyer-Moore algorithm, it should be possible to easily do case insensitive searches, as well as case-insensitive search and replace, without any temporary string (using StrUpper etc), and without calling Pos() which is slower than Boyer-Moore searching when repeated searches over the same text are required.

(Edit: I have a partial solution, written as an answer to this question, it is almost 100% complete, it even has a fast string replace function. I believe it MUST have bugs, and especially think that since it pretends to be Unicode capable that it must be that there are glitches due to unfulfilled Unicode promises. )

(Edit2: Interesting and unexpected result; The large stack size of a unicode code-point table on the stack - SkipTable in the code below puts a serious damper on the amount of win-win-optimization you can do here in a unicode string boyer-moore string search. Thanks to Florent Ouchet for pointing out what I should have noticed immediately.)

解决方案

This answer is now complete except that the code is not well (Unit) tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages). I do however have a Boyer-Moore based solution for case-insensitive Pos, and faster substring counting, and now search and replace too:

Based on Dorin Dominica's code, I built the following.

{ _FindStringBoyer: Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM. Credited to Dorin Duminica. } function _FindStringBoyer(const sString, sPattern: string; const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if bCaseSensitive then Result := (sString[StringIndex] = sPattern[PatternIndex]) else Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; var SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; begin if fromPos < 1 then raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]); LengthPattern := Length(sPattern); LengthString := Length(sString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[sPattern[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[sPattern[LengthPattern]]; SkipTable[sPattern[LengthPattern]] := Large; Index := fromPos + LengthPattern -1; Result := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[sString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin // Found, return. Result := Index - kIndex + 1; Index := Index + LengthPattern; exit; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[sString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; { Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of a substring inside the main string, at a much faster rate than we could have done otherwise. Another thing that would be great is to have a function that returns an array of find-locations, which would be way faster to do than repeatedly calling Pos. } function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer; var foundPos:Integer; fromPos:Integer; Limit:Integer; guard:Integer; SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if CaseSensitive then Result := (aSourceString[StringIndex] = aFindString[PatternIndex]) else Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin result := 0; foundPos := 1; fromPos := 1; Limit := Length(aSourceString); guard := Length(aFindString); Index := 0; LengthPattern := Length(aFindString); LengthString := Length(aSourceString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[aFindString[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[aFindString[LengthPattern]]; SkipTable[aFindString[LengthPattern]] := Large; while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin Index := fromPos + LengthPattern -1; if Index>Limit then break; kIndex := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[aSourceString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin // Found, return. //Result := Index - kIndex + 1; Index := Index + LengthPattern; fromPos := Index; Inc(Result); break; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[aSourceString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; end;

This is really a nice Algorithm, because:

it's way faster to count instances of substring X in string Y this way, magnificently so.
For merely replacing Pos() the _FindStringBoyer() is faster than the pure asm version of Pos() contributed to Delphi by FastCode project people, that is currently used for Pos, and if you need the case-insensitivity, you can imagine the performance boost when we don't have to call UpperCase on a 100 megabyte string. (Okay, your strings aren't going to be THAT big. But still, Efficient Algorithms are a Thing of Beauty.)

Okay I wrote a String Replace in Boyer-Moore style:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String; var errors:Integer; fromPos:Integer; Limit:Integer; guard:Integer; SkipTable: array [Char] of Integer; LengthPattern: Integer; LengthString: Integer; Index: Integer; kIndex: Integer; LastMarker: Integer; Large: Integer; chPattern: Char; CaseSensitive:Boolean; foundAt:Integer; lastFoundAt:Integer; copyStartsAt:Integer; copyLen:Integer; function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin if CaseSensitive then Result := (aSourceString[StringIndex] = aFindString[PatternIndex]) else Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0); end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean; begin result := ''; lastFoundAt := 0; fromPos := 1; errors := 0; CaseSensitive := rfIgnoreCase in Flags; Limit := Length(aSourceString); guard := Length(aFindString); Index := 0; LengthPattern := Length(aFindString); LengthString := Length(aSourceString); for chPattern := Low(Char) to High(Char) do SkipTable[chPattern] := LengthPattern; for Index := 1 to LengthPattern -1 do SkipTable[aFindString[Index]] := LengthPattern - Index; Large := LengthPattern + LengthString + 1; LastMarker := SkipTable[aFindString[LengthPattern]]; SkipTable[aFindString[LengthPattern]] := Large; while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin Index := fromPos + LengthPattern -1; if Index>Limit then break; kIndex := 0; foundAt := 0; while Index <= LengthString do begin repeat Index := Index + SkipTable[aSourceString[Index]]; until Index > LengthString; if Index <= Large then Break else Index := Index - Large; kIndex := 1; while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do Inc(kIndex); if kIndex = LengthPattern then begin foundAt := Index - kIndex + 1; Index := Index + LengthPattern; //fromPos := Index; fromPos := (foundAt+LengthPattern); if lastFoundAt=0 then begin copyStartsAt := 1; copyLen := foundAt-copyStartsAt; end else begin copyStartsAt := lastFoundAt+LengthPattern; copyLen := foundAt-copyStartsAt; end; if (copyLen<=0)or(copyStartsAt<=0) then begin Inc(errors); end; Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString; lastFoundAt := foundAt; if not (rfReplaceAll in Flags) then fromPos := 0; // break out of outer while loop too! break; end else begin if __SameChar(Index, LengthPattern) then Index := Index + LastMarker else Index := Index + SkipTable[aSourceString[Index]]; end; // if kIndex = LengthPattern then begin end; // while Index <= LengthString do begin end; if (lastFoundAt=0) then begin // nothing was found, just return whole original string Result := aSourceString; end else if (lastFoundAt+LengthPattern < Limit) then begin // the part that didn't require any replacing, because nothing more was found, // or rfReplaceAll flag was not specified, is copied at the // end as the final step. copyStartsAt := lastFoundAt+LengthPattern; copyLen := Limit; { this number can be larger than needed to be, and it is harmless } Result := Result + Copy(aSourceString, copyStartsAt, copyLen ); end; end;

Okay, problem: Stack footprint of this:

var skiptable : array [Char] of Integer; // 65536*4 bytes stack usage on Unicode delphi

Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.

更多推荐

有没有一个Boyer

本文发布于:2023-11-30 09:08:34，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1649467.html