简体   繁体   English

在Delphi TStringList中分割文本的更快捷方式

[英]Faster way to split text in Delphi TStringList

I have an app that needs to do heavy text manipulation in a TStringList. 我有一个需要在TStringList中进行大量文本操作的应用程序。 Basically i need to split text by a delimiter ; 基本上我需要用分隔符分割文本; for instance, if i have a singe line with 1000 chars and this delimiter occurs 3 times in this line, then i need to split it in 3 lines. 例如,如果我有一个带有1000个字符的单行,并且此分隔符在此行中出现3次,那么我需要将其拆分为3行。 The delimiter can contain more than one char, it can be a tag like '[test]' for example. 分隔符可以包含多个char,例如,它可以是“[test]”之类的标记。

I've wrote two functions to do this task with 2 different approaches, but both are slow in big amounts of text (more then 2mbytes usually). 我用两种不同的方法编写了两个函数来完成这个任务,但是大量的文本都很慢(通常超过2个字节)。

How can i achieve this goal in a faster way ? 我怎样才能更快地实现这一目标?

Here are both functions, both receive 2 paramaters : 'lines' which is the original tstringlist and 'q' which is the delimiter. 这两个函数都接收2个参数:'lines'是原始的tstringlist,'q'是分隔符。

function splitlines(lines : tstringlist; q: string) : integer;
var
  s, aux, ant : string;
  i,j : integer;
  flag : boolean;
  m2 : tstringlist;
begin
  try
    m2 := tstringlist.create;
    m2.BeginUpdate;
    result := 0;
    for i := 0 to lines.count-1 do
    begin
      s := lines[i];
      for j := 1 to length(s) do
      begin
        flag := lowercase(copy(s,j,length(q))) = lowercase(q);
        if flag then
        begin
          inc(result);
          m2.add(aux);
          aux := s[j];
        end
        else
          aux := aux + s[j];
      end;
      m2.add(aux);
      aux := '';
    end;
    m2.EndUpdate;
    lines.text := m2.text;
  finally
    m2.free;
  end;
end;


function splitLines2(lines : tstringlist; q: string) : integer;
var
  aux, p : string;
  i : integer;
  flag : boolean;
begin
  //maux1 and maux2 are already instanced in the parent class
  try
    maux2.text := lines.text;
    p := '';
    i := 0;
    flag := false;
    maux1.BeginUpdate;
    maux2.BeginUpdate;
    while (pos(lowercase(q),lowercase(maux2.text)) > 0) and (i < 5000) do
    begin
      flag := true;
      aux := p+copy(maux2.text,1,pos(lowercase(q),lowercase(maux2.text))-1);
      maux1.add(aux);
      maux2.text := copy(maux2.text,pos(lowercase(q),lowercase(maux2.text)),length(maux2.text));
      p := copy(maux2.text,1,1);
      maux2.text := copy(maux2.text,2,length(maux2.text));
      inc(i);
    end;
  finally
    result := i;
    maux1.EndUpdate;
    maux2.EndUpdate;
    if flag then
    begin
      maux1.add(p+maux2.text);
      lines.text := maux1.text;
    end;
  end;
end;

I've not tested the speed, but for academic purposes, here's an easy way to split the strings: 我没有测试速度,但出于学术目的,这里有一个简单的方法来分割字符串:

myStringList.Text :=
  StringReplace(myStringList.Text, myDelimiter, #13#10, [rfReplaceAll]);
// Use [rfReplaceAll, rfIgnoreCase] if you want to ignore case

When you set the Text property of TStringList , it parses on new lines and splits there, so converting to a string, replacing the delimiter with new lines, then assigning it back to the Text property works. 当您设置TStringListText属性时,它会解析新行并在那里拆分,因此转换为字符串,用新行替换分隔符,然后将其分配回Text属性。

The problems with your code (at least second approach) are 您的代码存在问题(至少是第二种方法)

  • You are constantly using lowecase which is slow if called so many times 你经常使用lowecase,如果这么多次调用,它会很慢
  • If I saw correctly you are copying the whole remaining text back to the original source. 如果我看到你正确地将整个剩余的文本复制回原始来源。 This is sure to be extra slow for large strings (eg files) 对于大字符串(例如文件)来说,这肯定会非常慢

I have a tokenizer in my library. 我的库里有一个tokenizer。 Its not the fastest or best but it should do (you can get it from Cromis Library , just use the units Cromis.StringUtils and Cromis.Unicode): 它不是最快或最好但它应该做(你可以从Cromis库中获取它,只需使用单位Cromis.StringUtils和Cromis.Unicode):

type
  TTokens = array of ustring;

  TTextTokenizer = class
  private
    FTokens: TTokens;
    FDelimiters: array of ustring;
  public
    constructor Create;
    procedure Tokenize(const Text: ustring);
    procedure AddDelimiters(const Delimiters: array of ustring);
    property Tokens: TTokens read FTokens;
  end;

{ TTextTokenizer }

procedure TTextTokenizer.AddDelimiters(const Delimiters: array of ustring);
var
  I: Integer;
begin
  if Length(Delimiters) > 0 then
  begin
    SetLength(FDelimiters, Length(Delimiters));

    for I := 0 to Length(Delimiters) - 1 do
      FDelimiters[I] := Delimiters[I];
  end;
end;

constructor TTextTokenizer.Create;
begin
  SetLength(FTokens, 0);
  SetLength(FDelimiters, 0);
end;

procedure TTextTokenizer.Tokenize(const Text: ustring);
var
  I, K: Integer;
  Counter: Integer;
  NewToken: ustring;
  Position: Integer;
  CurrToken: ustring;
begin
  SetLength(FTokens, 100);
  CurrToken := '';
  Counter := 0;

  for I := 1 to Length(Text) do
  begin
    CurrToken := CurrToken + Text[I];

    for K := 0 to Length(FDelimiters) - 1 do
    begin
      Position := Pos(FDelimiters[K], CurrToken);

      if Position > 0 then
      begin
        NewToken := Copy(CurrToken, 1, Position - 1);

        if NewToken <> '' then
        begin
          if Counter > Length(FTokens) then
            SetLength(FTokens, Length(FTokens) * 2);

          FTokens[Counter] := Trim(NewToken);
          Inc(Counter)
        end;

        CurrToken := '';
      end;
    end;
  end;

  if CurrToken <> '' then
  begin
    if Counter > Length(FTokens) then
      SetLength(FTokens, Length(FTokens) * 2);

    FTokens[Counter] := Trim(CurrToken);
    Inc(Counter)
  end;

  SetLength(FTokens, Counter);
end;

How about just using StrTokens from the JCL library 如何使用JCL库中的StrTokens

procedure StrTokens(const S: string; const List: TStrings); 过程StrTokens(const S:string; const List:TStrings);

It's open source http://sourceforge.net/projects/jcl/ 它是开源的http://sourceforge.net/projects/jcl/

As an additional option, you can use regular expressions. 作为附加选项,您可以使用正则表达式。 Recent versions of Delphi (XE4 and XE5) come with built in regular expression support; 最新版本的Delphi(XE4和XE5)具有内置的正则表达式支持; older versions can find a free regex library download (zip file) at Regular-Expressions.info . 旧版本可以在Regular-Expressions.info上找到免费的正则表达式库下载(zip文件)

For the built-in regex support (uses the generic TArray<string> ): 对于内置的正则表达式支持(使用通用TArray<string> ):

var
  RegexObj: TRegEx;
  SplitArray: TArray<string>;
begin
  SplitArray := nil;
  try
    RegexObj := TRegEx.Create('\[test\]'); // Your sample expression. Replace with q
    SplitArray := RegexObj.Split(Lines, 0);
  except
    on E: ERegularExpressionError do begin
    // Syntax error in the regular expression
    end;
  end;
  // Use SplitArray
end;

For using TPerlRegEx in earlier Delphi versions: 在早期的Delphi版本中使用TPerlRegEx:

var
  Regex: TPerlRegEx;
  m2: TStringList;
begin
  m2 := TStringList.Create;
  try
    Regex := TPerlRegEx.Create;
    try
      Regex.RegEx := '\[test\]';  //  Using your sample expression - replace with q
      Regex.Options := [];
      Regex.State := [preNotEmpty];
      Regex.Subject := Lines.Text;
      Regex.SplitCapture(m2, 0);
    finally
      Regex.Free;
    end;
    // Work with m2
  finally
    m2.Free;
  end;
end;

(For those unaware, the \\ in the sample expression used are because the [] characters are meaningful in regular expressions and need to be escaped to be used in the regular expression text. Typically, they're not required in the text.) (对于那些不知道的人,使用的示例表达式中的\\是因为[]字符在正则表达式中有意义,需要转义才能在正则表达式文本中使用。通常,它们在文本中不是必需的。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM