简体   繁体   English

如何在字符串中搜索子字符串的许多变体

[英]How to search for many variations of a substring in string

I am trying to search a for a sub string in a string, but figure there has to be a more efficient way then this.. 我试图在一个字符串中搜索一个子字符串,但是必须有一个更有效的方法然后这个..

      //search for volume
     if AnsiContainsStr(SearchString, 'v1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'V1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Volume1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Volume 1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Vol1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'vol1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Vol 1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'vol 1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Vol.1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'vol.1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Vol. 1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'vol. 1') then
         Volume := '1';


     if AnsiContainsStr(SearchString, 'v2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'V2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Volume2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Volume 2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Vol2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'vol2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Vol 2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'vol 2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Vol.2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'vol.2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Vol. 2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'vol. 2') then
         Volume := '2';

Since you tagged this with XE2, you can use regular expression to make this match easily 由于您使用XE2对其进行了标记,因此您可以使用正则表达式轻松进行匹配

  var
     Regex: String;
  begin
     Regex := '^[v](ol\.?|olume)?\s*(1|\.\s*1)$';
     if TRegEx.IsMatch(SearchString, Regex, [roIgnoreCase]) then
        Volume := '1'
     Regex := '^[v](ol\.?|olume)?\s*(2|\.\s*2)$';
     if TRegEx.IsMatch(SearchString, Regex, [roIgnoreCase]) then
        Volume := '2'
  end;

Now, I'm not the best at devising a regular expression, but I tested the one above and it seems to match all your variations (maybe someone else can come up with one that is more succinct). 现在,我不是最好的设计正则表达式,但我测试了上面的那个,它似乎匹配你所有的变化(也许其他人可以想出一个更简洁的)。

For a lot of strings and frequent search, using a suffix tree would be your best bet. 对于很多字符串和频繁搜索,使用后缀树将是您最好的选择。 Otherwise an easier way using regular expression could also help, your strings look regular enough. 否则使用正则表达式的更简单方法也可以提供帮助,您的字符串看起来足够规

Building on @user582118's answer: 建立在@ user582118的答案上:

If you use ^v(ol\\.?|olume)?\\s*([0-9]+)$ as the RegEx pattern, you don't have to try for each and every possible numerical value. 如果使用^v(ol\\.?|olume)?\\s*([0-9]+)$作为RegEx模式,则不必尝试每个可能的数值。 It will match with 1 or more numeric characters at the end. 它将与最后的一个或多个数字字符匹配。 You can then use TMatch 's Value and Groups properties to extract the number from the string. 然后,您可以使用TMatchValueGroups属性从字符串中提取数字。

var
  RegEx: TRegEx; // This is a record, not a class, and doesn't need to be freed!
  Match: TMatch;
  i: Integer;
begin
  RegEx := TRegEx.Create('^v(ol\.?|olume)?\s*([0-9]+)$');
  Match := RegEx.Match('vol.3456');
  WriteLn('Value: ' + Match.Value);
  for i := 0 to Match.Groups.Count - 1 do
    WriteLn('Group', i, ': ', Match.Groups[i].Value);
end;

Gives: 得到:

Value: vol.3456
Group0: vol.3456
Group1: ol.
Group2: 3456

Try something like this: 尝试这样的事情:

const
  Prefixes: array[0..6] of String = (
    'VOLUME '
    'VOLUME'
    'VOL. '
    'VOL '
    'VOL.'
    'VOL'
    'V'
  );

var
  S: String;
  P: PChar;
  I, J, Len: Integer;
  Volume: Char;
begin
  Volume = #0;
  S := UpperCase(SearchString);
  P := PChar(S);
  Len := Length(S);
  I := 1;
  while (Len > 0) and (Volume = #0) do
  begin
    if (P^ <> 'V') then begin
      Inc(P);
      Dec(Len);
      Continue;
    end;
    for J := Low(Prefixes) to High(Prefixes) do
    begin
      if AnsiStrLComp(P, PChar(Prefixes[J]), Length(Prefixes[J])) = 0 then
      begin
        Inc(P, Length(Prefixes[J]));
        Dec(Len, Length(Prefixes[J]));
        if (Len > 0) then begin
          if (P^ >= '1') and (P^ <= '7') then
            Volume := P^;
        end;
        Break;
      end;
    end;
  end;
end;

I had to do something similar once for comparing mailing addresses. 为了比较邮件地址,我不得不做类似的事情。 I stripped out white space and punctuation. 我删除了空格和标点符号。 Then I used CompareText so it was case insensitive. 然后我使用了CompareText,因此它不区分大小写。

A lot of your If statements deal with comparing strings that may or may not have a period or space between "Vol" or "Volume" and the number. 很多你的If语句都涉及比较可能有或没有“Vol”或“Volume”之间的句号或空格的字符串和数字。 Remove the period and whitespace and you are left with two If statements per volume number: one for VOL and one for VOLUME. 删除句点和空格,每个卷号留下两个If语句:一个用于VOL,一个用于VOLUME。 You might even be able to whittle that down to one If statement per volume by replacing "volume" with "vol". 您甚至可以通过将“volume”替换为“vol”来将每个卷减少到一个If语句。

Make your search string upper case first (once), and then do each check just against an upper case version of the search string. 首先使您的搜索字符串大写(一次),然后针对搜索字符串的大写版本执行每个检查。 That reduces the number of checks by half without requiring case-insensitive searches (which may change case of both strings every time). 这样可以将检查次数减少一半而不需要不区分大小写的搜索(这可能会每次都改变两个字符串的大小写)。

You could go a step further and use one of the wildcard match functions in the JCL such as StrMatches. 您可以更进一步,使用JCL中的一个通配符匹配函数,例如StrMatches。 However, while this would reduce the number of lines of code it could not be as fast as having the specific matches. 然而,虽然这会减少代码行数,但它不能像具有特定匹配一样快。

If you expect to make many different values for Volume, write your own function to search for the alphabetic part of the string, then do a separate check for what number comes after it. 如果您希望为Volume创建许多不同的值,请编写自己的函数以搜索字符串的字母部分,然后单独检查后面的数字。

If you want it easy but slow - go RegExp way. 如果你想要它容易但很慢 - 去RegExp方式。

If you want it fast, then read answer by @LeleDumbo. 如果您想要快速,请阅读@LeleDumbo的回答。

BUT! 但! Before real search make a copy of string all uppercase - AnsiUpperCase function. 在真正的搜索之前使字符串全部大写 - AnsiUpperCase函数。 Case-insensitive search slows down on every character. 不区分大小写的搜索会减慢每个字符的速度。 It would be better to make upcase copy of both string and search patterns. 最好是复制字符串和搜索模式。 (Oh, @RobMcDonell already told you that :-) ) (哦,@ RobMcDonell已告诉你:-))

You are to convert prefixes into tree. 您要将前缀转换为树。 Okay, in this simple example it would fit into a list (array): "V", "OL", "UME" in more complex case you could have search for V-OL-UME or V-ER-SION with same start and splitting tails) 好的,在这个简单的例子中它将适合列表(数组):“V”,“OL”,“UME” 在更复杂的情况下你可以用相同的开始搜索V-OL-UME或V-ER-SION和分裂的尾巴)

Then read about http://en.wikipedia.org/wiki/Finite-state_machine - that is what u have to do. 然后阅读http://en.wikipedia.org/wiki/Finite-state_machine - 这就是你必须要做的事情。

A simple draft (not covering all possible use cases, for example "Vol . 2.2" ) would be: 一个简单的草案(不涵盖所有可能的用例,例如“Vol.2.2”)将是:

Start in search-txt-1 state, #1 char to look. 从search-txt-1状态开始,#1 char来查看。 On each loop you have current state and current number of character to think of(thinking all to the left already scanned): 在每个循环中,您有当前状态和当前要考虑的字符数(想到左边已经扫描过的所有字符):

  1. if state is search-txt-1, then search for txt-1 (namely "V") at current character and anywhere to the right ( System.StrUtils.PosEx function) 如果state是search-txt-1,则在当前字符和右边任意位置搜索txt-1(即“V”)(System.StrUtils.PosEx函数)

    1.1. 1.1。 If not found - exit the loop, no text found 如果未找到 - 退出循环,则找不到文本

    1.2. 1.2。 If found - inc(current-number), state := search-txt-2, next loop 如果找到 - inc(当前数字),则:= search-txt-2,下一个循环

  2. if state is search-txt-2, then search for txt-2 ("UM") at current character only! 如果state是search-txt-2,则仅搜索当前字符的txt-2(“UM”)! (lazy: System.Copy(txt, current-char, system.length(txt-2)) = txt-2; fast: special comparison with length and offset from Jedi CodeLibrary) (懒惰:System.Copy(txt,current-char,system.length(txt-2))= txt-2; fast:与Jedi CodeLibrary的长度和偏移量的特殊比较)

    2.1 if found, inc(current-number, length(txt-2), state := search-txt-3, next loop 2.1如果找到,inc(当前数字,长度(txt-2),状态:= search-txt-3,下一循环

    2.2 if not found, do NOT change current-number, state := skip-dot, next loop 2.2如果没有找到,请不要更改当前编号,状态:= skip-dot,next loop

  3. if state is search-txt-3, then search for txt-3 like above 如果state是search-txt-3,那么就像上面一样搜索txt-3

    3.1 if found, inc(current-number, length(txt-3), state := skip-dot, next loop 3.1 if found,inc(current-number,length(txt-3),state:= skip-dot,next loop

    3.2 if not found, do NOT change current-number, state := skip-dot, next loop 3.2如果没有找到,请不要更改当前编号,状态:=跳过点,下一个循环

  4. if state is skip-dot, look if current-char is dot 如果state是skip-dot,看看current-char是否为dot

    4.1 if it is, inc (current-number), state := skip-few-blanks, next loop 4.1如果是,inc(current-number),state:= skip-few-blanks,next loop

    4.2 if it is not do NOT change current-number, state := skip-few-blanks, next loop 4.2如果不是不改变当前数字,则表示:= skip-few-blanks,next loop

  5. if skip-few-blanks then look if current-char is " " 如果skip-few-blanks则查看current-char是否为“”

    5.1 if it is, inc (current-number), state := skip-few-blanks, next loop (there may be more blanks) 5.1如果是,inc(current-number),state:= skip-few-blanks,next loop(可能有更多空格)

    5.2 if it is not do NOT change current-number, state := maybe-number, next loop 5.2如果不是不改变当前数字,则:= maybe-number,next loop

  6. if maybe-number then System.Character.IsDigit(current-char) ??? if maybe-number然后是System.Character.IsDigit(current-char)???

    6.1 if not - no number, search failed, next try - do NOT change current-number, state := search-txt-1, next loop 6.1如果不是 - 没有数字,搜索失败,下次尝试 - 不要改变当前号码,状态:= search-txt-1,下一个循环

    6.2 if is, remember where number started, state := reading-number, inc (current-number), next loop 6.2如果是,记住数字开始的地方,状态:= reading-number,inc(当前数字),下一个循环

  7. if reading-number then System.Character.IsDigit(current-char) ??? 如果读取数字然后System.Character.IsDigit(current-char)???

    7.1 if it is - one more digit - state := reading-number, inc (current-number), next loop 7.1如果是 - 再多一位 - 状态:=读数,inc(当前数),下一循环

    7.2 if it is not - number over - get slice of string from digit start to previous character (last digit), convert it (IntToStr(Copy(string, number-start, number-length)) and exit the loop ( you do not search several numbers in one string, do you? ) 7.2如果不是 - 数字超过 - 从数字开始到前一个字符(最后一个数字)获取字符串片段,转换它(IntToStr(复制(字符串,数字 - 开始,数字长度))并退出循环(你不在一个字符串中搜索几个数字,对吗?)

For more complex grammars there are tools like Yacc/Bison. 对于更复杂的语法,有像Yacc / Bison这样的工具。 But for such simple one you can maek your own custom FSM, it would be not hard but most fast way. 但是对于这么简单的你可以选择你自己的自定义FSM,它并不难,但最快的方式。 Just be very attentive and not make errors in state transitions and current-char number shifts. 只是要非常注意,不要在状态转换和当前字符数字转换中出错。

I hope i did not make but you have to test it. 我希望我没有,但你必须测试它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM