[英]Faster way to split text in Delphi TStringList
I have an app that needs to do heavy text manipulation in a TStringList. 我有一个需要在TStringList中进行大量文本操作的应用程序。 Basically i need to split text by a delimiter ;
基本上我需要用分隔符分割文本; for instance, if i have a singe line with 1000 chars and this delimiter occurs 3 times in this line, then i need to split it in 3 lines.
例如,如果我有一个带有1000个字符的单行,并且此分隔符在此行中出现3次,那么我需要将其拆分为3行。 The delimiter can contain more than one char, it can be a tag like '[test]' for example.
分隔符可以包含多个char,例如,它可以是“[test]”之类的标记。
I've wrote two functions to do this task with 2 different approaches, but both are slow in big amounts of text (more then 2mbytes usually). 我用两种不同的方法编写了两个函数来完成这个任务,但是大量的文本都很慢(通常超过2个字节)。
How can i achieve this goal in a faster way ? 我怎样才能更快地实现这一目标?
Here are both functions, both receive 2 paramaters : 'lines' which is the original tstringlist and 'q' which is the delimiter. 这两个函数都接收2个参数:'lines'是原始的tstringlist,'q'是分隔符。
function splitlines(lines : tstringlist; q: string) : integer;
var
s, aux, ant : string;
i,j : integer;
flag : boolean;
m2 : tstringlist;
begin
try
m2 := tstringlist.create;
m2.BeginUpdate;
result := 0;
for i := 0 to lines.count-1 do
begin
s := lines[i];
for j := 1 to length(s) do
begin
flag := lowercase(copy(s,j,length(q))) = lowercase(q);
if flag then
begin
inc(result);
m2.add(aux);
aux := s[j];
end
else
aux := aux + s[j];
end;
m2.add(aux);
aux := '';
end;
m2.EndUpdate;
lines.text := m2.text;
finally
m2.free;
end;
end;
function splitLines2(lines : tstringlist; q: string) : integer;
var
aux, p : string;
i : integer;
flag : boolean;
begin
//maux1 and maux2 are already instanced in the parent class
try
maux2.text := lines.text;
p := '';
i := 0;
flag := false;
maux1.BeginUpdate;
maux2.BeginUpdate;
while (pos(lowercase(q),lowercase(maux2.text)) > 0) and (i < 5000) do
begin
flag := true;
aux := p+copy(maux2.text,1,pos(lowercase(q),lowercase(maux2.text))-1);
maux1.add(aux);
maux2.text := copy(maux2.text,pos(lowercase(q),lowercase(maux2.text)),length(maux2.text));
p := copy(maux2.text,1,1);
maux2.text := copy(maux2.text,2,length(maux2.text));
inc(i);
end;
finally
result := i;
maux1.EndUpdate;
maux2.EndUpdate;
if flag then
begin
maux1.add(p+maux2.text);
lines.text := maux1.text;
end;
end;
end;
I've not tested the speed, but for academic purposes, here's an easy way to split the strings: 我没有测试速度,但出于学术目的,这里有一个简单的方法来分割字符串:
myStringList.Text :=
StringReplace(myStringList.Text, myDelimiter, #13#10, [rfReplaceAll]);
// Use [rfReplaceAll, rfIgnoreCase] if you want to ignore case
When you set the Text
property of TStringList
, it parses on new lines and splits there, so converting to a string, replacing the delimiter with new lines, then assigning it back to the Text
property works. 当您设置
TStringList
的Text
属性时,它会解析新行并在那里拆分,因此转换为字符串,用新行替换分隔符,然后将其分配回Text
属性。
The problems with your code (at least second approach) are 您的代码存在问题(至少是第二种方法)
I have a tokenizer in my library. 我的库里有一个tokenizer。 Its not the fastest or best but it should do (you can get it from Cromis Library , just use the units Cromis.StringUtils and Cromis.Unicode):
它不是最快或最好但它应该做(你可以从Cromis库中获取它,只需使用单位Cromis.StringUtils和Cromis.Unicode):
type
TTokens = array of ustring;
TTextTokenizer = class
private
FTokens: TTokens;
FDelimiters: array of ustring;
public
constructor Create;
procedure Tokenize(const Text: ustring);
procedure AddDelimiters(const Delimiters: array of ustring);
property Tokens: TTokens read FTokens;
end;
{ TTextTokenizer }
procedure TTextTokenizer.AddDelimiters(const Delimiters: array of ustring);
var
I: Integer;
begin
if Length(Delimiters) > 0 then
begin
SetLength(FDelimiters, Length(Delimiters));
for I := 0 to Length(Delimiters) - 1 do
FDelimiters[I] := Delimiters[I];
end;
end;
constructor TTextTokenizer.Create;
begin
SetLength(FTokens, 0);
SetLength(FDelimiters, 0);
end;
procedure TTextTokenizer.Tokenize(const Text: ustring);
var
I, K: Integer;
Counter: Integer;
NewToken: ustring;
Position: Integer;
CurrToken: ustring;
begin
SetLength(FTokens, 100);
CurrToken := '';
Counter := 0;
for I := 1 to Length(Text) do
begin
CurrToken := CurrToken + Text[I];
for K := 0 to Length(FDelimiters) - 1 do
begin
Position := Pos(FDelimiters[K], CurrToken);
if Position > 0 then
begin
NewToken := Copy(CurrToken, 1, Position - 1);
if NewToken <> '' then
begin
if Counter > Length(FTokens) then
SetLength(FTokens, Length(FTokens) * 2);
FTokens[Counter] := Trim(NewToken);
Inc(Counter)
end;
CurrToken := '';
end;
end;
end;
if CurrToken <> '' then
begin
if Counter > Length(FTokens) then
SetLength(FTokens, Length(FTokens) * 2);
FTokens[Counter] := Trim(CurrToken);
Inc(Counter)
end;
SetLength(FTokens, Counter);
end;
How about just using StrTokens from the JCL library 如何使用JCL库中的StrTokens
procedure StrTokens(const S: string; const List: TStrings); 过程StrTokens(const S:string; const List:TStrings);
It's open source http://sourceforge.net/projects/jcl/ 它是开源的http://sourceforge.net/projects/jcl/
As an additional option, you can use regular expressions. 作为附加选项,您可以使用正则表达式。 Recent versions of Delphi (XE4 and XE5) come with built in regular expression support;
最新版本的Delphi(XE4和XE5)具有内置的正则表达式支持; older versions can find a free regex library download (zip file) at Regular-Expressions.info .
旧版本可以在Regular-Expressions.info上找到免费的正则表达式库下载(zip文件) 。
For the built-in regex support (uses the generic TArray<string>
): 对于内置的正则表达式支持(使用通用
TArray<string>
):
var
RegexObj: TRegEx;
SplitArray: TArray<string>;
begin
SplitArray := nil;
try
RegexObj := TRegEx.Create('\[test\]'); // Your sample expression. Replace with q
SplitArray := RegexObj.Split(Lines, 0);
except
on E: ERegularExpressionError do begin
// Syntax error in the regular expression
end;
end;
// Use SplitArray
end;
For using TPerlRegEx in earlier Delphi versions: 在早期的Delphi版本中使用TPerlRegEx:
var
Regex: TPerlRegEx;
m2: TStringList;
begin
m2 := TStringList.Create;
try
Regex := TPerlRegEx.Create;
try
Regex.RegEx := '\[test\]'; // Using your sample expression - replace with q
Regex.Options := [];
Regex.State := [preNotEmpty];
Regex.Subject := Lines.Text;
Regex.SplitCapture(m2, 0);
finally
Regex.Free;
end;
// Work with m2
finally
m2.Free;
end;
end;
(For those unaware, the \\
in the sample expression used are because the []
characters are meaningful in regular expressions and need to be escaped to be used in the regular expression text. Typically, they're not required in the text.) (对于那些不知道的人,使用的示例表达式中的
\\
是因为[]
字符在正则表达式中有意义,需要转义才能在正则表达式文本中使用。通常,它们在文本中不是必需的。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.