简体   繁体   English

在Delphi中在流中查找字符串的有效方法

[英]Efficient way to find a string in a stream in Delphi

I have come up with this function to return the number of occurrences of a string in a Delphi Stream. 我想出了这个函数来返回Delphi Stream中字符串出现的次数。 However, I suspect there is a more efficient way to achieve this, since I am using "higher level" constructs (char), and not working at the lower byte/pointer level (which I am not that familiar with) 但是,我怀疑有一种更有效的方法来实现此目的,因为我使用的是“更高级别”的结构(字符),而不是在较低字节/指针级别(我不太熟悉)下工作

function ReadStream(const S: AnsiString; Stream: TMemoryStream): Integer;
var
  Arr: Array of AnsiChar;
  Buf: AnsiChar;
  ReadCount: Integer;

  procedure AddChar(const C: AnsiChar);
  var
    I: Integer;
  begin
    for I := 1 to Length(S) - 1 do
      Arr[I] := Arr[I+1];
    Arr[Length(S)] := C;
  end;

  function IsEqual: Boolean;
  var
    I: Integer;
  begin
    Result := True;
    for I := 1 to Length(S) do
      if S[I] <> Arr[I] then
      begin
        Result := False;
        Break;;
      end;
  end;

begin
  Stream.Position := 0;
  SetLength(Arr, Length(S));
  Result := 0;
  repeat
    ReadCount := Stream.Read(Buf, 1);
    AddChar(Buf);
    if IsEqual then
      Inc(Result);
  until ReadCount = 0;
end;

Can someone supply a procedure that is more efficient? 有人可以提供更有效的程序吗?

Stream has a method that will let you get into the internal buffer. 流有一个方法,可以让您进入内部缓冲区。

You can get a pointer to the internal buffer using the Memory property . 您可以使用Memory属性获得指向内部缓冲区的指针。

If you are working in 32 bit and you are willing to let go of the deprecated TMemoryStream and use TBytesStream instead you can 如果您在32位工作你愿意放手过时的TMemoryStream和使用TBytesStream而是可以 use 采用 abuse the fact that a dynamic array and an AnsiString share the same structure in 32 bit. 滥用了动态数组和AnsiString共享32位相同结构的事实。
Unfortunately Emba broke that compatibility in X64, Which means that for no good reason whatsoever you cannot have strings > 2GB in X64. 不幸的是,Emba在X64中破坏了这种兼容性,这意味着没有充分的理由,您在X64中不能有大于2GB的字符串。

Note that this trick will break in 64 bit! 请注意,此技巧将破解64位! (See fix below) (请参阅下面的修复程序)

You can use Boyer-Moore string searching . 您可以使用Boyer-Moore字符串搜索

This allows you to write code like this: 这使您可以编写如下代码:

function CountOccurrances(const Needle: AnsiString; const Haystack: TBytesStream): integer;
var
  Start: cardinal;
  Count: integer;
begin 
  Start:= 1;
  Count:= 0;
  repeat
    {$ifdef CPUx86}
    Start:= _FindStringBoyerAnsiString(string(HayStack.Memory), Needle, false, Start);
    {$else}
    Start:= __FindStringBoyerAnsiStringIn64BitTArrayByte(TArray<Byte>(HaySAtack.Memory), Needle, false, Start);
    {$endif}
    if Start >= 1 then begin
      Inc(Start, Length(Needle));
      Inc(Count);
    end;
  until Start <= 0;
  Result:= Count;
end;

For 32 bit you'll have to rewrite the BoyerMoore code to use AnsiString ; 对于32位,您将不得不重写BoyerMoore代码以使用AnsiString ; a trivial rewrite. 琐碎的重写。
For 64 bit you'll have to rewrite the BoyerMoore code to use a TArray<byte> as a first parameter; 对于64位,您将不得不重写BoyerMoore代码以使用TArray<byte>作为第一个参数。 a relatively simple task. 相对简单的任务。

If you are looking for efficiency, please try and avoid WinAPI calls that use pchars. 如果您正在寻找效率,请尝试避免使用pchars的WinAPI调用。 c-style strings are a horrible idea, because they do not have a length prefix. c风格的字符串是一个可怕的主意,因为它们没有长度前缀。

Johan has given you a good answer about Boyer-Moore searching. Johan给您有关Boyer-Moore搜索的好答案。 BM is fine if your are content to use it as a "black box", but if you want to understand what's going on, there is a bit of a gulf between the complexity of your own code and a BM implementation. 如果您愿意将BM用作“黑匣子”,则BM很好,但是如果您想了解发生了什么,则您自己的代码的复杂性与BM实现之间会有一些鸿沟。

You might find it helpful to explore searching that's more efficient than your own code but not so complex as BM. 您可能会发现探索比您自己的代码更有效但又不如BM复杂的搜索很有帮助。 There is one ultra-simple way to do what you want without getting invoved with pointers, PChars, etc. 有一种超简单的方法可以执行您想要的操作而无需使用指针,PChars等。

Let's leave aside for a moment the fact that you want to work with a TMemoryStream, and consider finding the number of occurrences of a string SubStr in another string Target . 让我们暂时忽略您想使用TMemoryStream的事实,并考虑查找另一个字符串Target中一个字符串SubStr的出现次数。

For efficiency, things you want to avoid are a) repeatedly scanning the same characters over and over and b) copying one or both strings. 为了提高效率,您需要避免的事情是:a)反复扫描相同的字符,以及b)复制一个或两个字符串。

Since D7, Delphi has included a PosEx function: 从D7开始,Delphi包含了PosEx函数:

function PosEx(const SubStr, S: string; Offset: Cardinal = 1): Integer; 函数PosEx(const SubStr,S:string; Offset:Cardinal = 1):整数; Description PosEx returns the index of SubStr in S, beginning the search at Offset. 说明PosEx返回S中SubStr的索引,从Offset开始搜索。 If Offset is 1 (default), PosEx is equivalent to Pos. 如果“偏移”为1(默认值),则PosEx等效于Pos。 PosEx returns 0 if SubStr is not found, if Offset is greater than the length of S, or if Offset is less than 1. 如果未找到SubStr,偏移量大于S的长度或偏移量小于1,则PosEx返回0。

So what you can do is repeatedly call PosEx , starting with Offset = 1, and each time it finds SubStr in Target you increment Offset to skip over it, like this (in a console application): 因此,您可以做的是重复调用PosEx ,从Offset = 1开始,每次它在Target找到SubStr ,都可以递增Offset以跳过它,例如在控制台应用程序中:

function ContainsCount(const SubStr, Target : String) : Integer;
var
  i : Integer;
begin
  Result := 0;
  i := 1;
  repeat
    i := PosEx(SubStr, Target, i);
    if i > 0 then begin
      Inc(Result);
      i := i + Length(SubStr);
    end;
  until i <= 0;
end;

var
  Count : Integer;
  Target : String;
begin
  Target := 'aa b ca';
  Count := ContainsCount('a', Target);
  writeln(Count);
  readln;
end.

The fact that PosEx and ContainsCount both pass SubStr and Target as consts meants that no string copying is involved, and it should be obvious that ContainsCount never scans the same characters more that once. PosExContainsCount都将SubStrTarget作为const传递,这一事实意味着不涉及任何字符串复制,而且显然ContainsCount绝不会重复扫描相同的字符一次。

Once you've satisfied yourself that this works, you might care to trace into PosEx to see how it does its stuff. 一旦您PosEx起作用感到满意,您可能会希望追溯到PosEx以查看其工作方式。

You can do something which works in a similar way on PChars using the RTL functions StrPos / AnsiStrPos 您可以使用RTL函数StrPos / AnsiStrPos在PChars上执行类似的AnsiStrPos

To convert your memory stream to a string, you could use this code from Rob Kennedy's answer to this q Converting TMemoryStream to 'String' in Delphi 2009 要将内存流转换为字符串,可以使用Rob Kennedy的答案中的以下代码:q 在Delphi 2009中将TMemoryStream转换为'String'

function MemoryStreamToString(M: TMemoryStream): string;
begin
  SetString(Result, PChar(M.Memory), M.Size div SizeOf(Char));
end;

(Note what he says about the alternative version later in his answer) (请注意,他稍后会在回答中对替代版本说些什么)

Btw, if you look through the VCL + RTL code, you'll see that quite a lot of the string-parsing and processing code (eg in TParser, TStringList, TExpressionParser) all does its work with PChars. 顺便说一句,如果您查看VCL + RTL代码,您会发现相当多的字符串解析和处理代码(例如,在TParser,TStringList,TExpressionParser中)都与PChars一起工作。 There's a reason for that of course, because it minimizes character copying and means that most scanning operations can be done by changing pointer (PChar) values. 当然有一个原因,因为它最大程度地减少了字符复制,并且意味着大多数扫描操作都可以通过更改指针(PChar)值来完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM