简体   繁体   中英

How to search for many variations of a substring in string

I am trying to search a for a sub string in a string, but figure there has to be a more efficient way then this..

      //search for volume
     if AnsiContainsStr(SearchString, 'v1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'V1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Volume1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Volume 1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Vol1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'vol1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Vol 1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'vol 1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Vol.1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'vol.1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'Vol. 1') then
         Volume := '1';
     if AnsiContainsStr(SearchString, 'vol. 1') then
         Volume := '1';


     if AnsiContainsStr(SearchString, 'v2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'V2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Volume2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Volume 2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Vol2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'vol2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Vol 2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'vol 2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Vol.2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'vol.2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'Vol. 2') then
         Volume := '2';
     if AnsiContainsStr(SearchString, 'vol. 2') then
         Volume := '2';

Since you tagged this with XE2, you can use regular expression to make this match easily

  var
     Regex: String;
  begin
     Regex := '^[v](ol\.?|olume)?\s*(1|\.\s*1)$';
     if TRegEx.IsMatch(SearchString, Regex, [roIgnoreCase]) then
        Volume := '1'
     Regex := '^[v](ol\.?|olume)?\s*(2|\.\s*2)$';
     if TRegEx.IsMatch(SearchString, Regex, [roIgnoreCase]) then
        Volume := '2'
  end;

Now, I'm not the best at devising a regular expression, but I tested the one above and it seems to match all your variations (maybe someone else can come up with one that is more succinct).

For a lot of strings and frequent search, using a suffix tree would be your best bet. Otherwise an easier way using regular expression could also help, your strings look regular enough.

Building on @user582118's answer:

If you use ^v(ol\\.?|olume)?\\s*([0-9]+)$ as the RegEx pattern, you don't have to try for each and every possible numerical value. It will match with 1 or more numeric characters at the end. You can then use TMatch 's Value and Groups properties to extract the number from the string.

var
  RegEx: TRegEx; // This is a record, not a class, and doesn't need to be freed!
  Match: TMatch;
  i: Integer;
begin
  RegEx := TRegEx.Create('^v(ol\.?|olume)?\s*([0-9]+)$');
  Match := RegEx.Match('vol.3456');
  WriteLn('Value: ' + Match.Value);
  for i := 0 to Match.Groups.Count - 1 do
    WriteLn('Group', i, ': ', Match.Groups[i].Value);
end;

Gives:

Value: vol.3456
Group0: vol.3456
Group1: ol.
Group2: 3456

Try something like this:

const
  Prefixes: array[0..6] of String = (
    'VOLUME '
    'VOLUME'
    'VOL. '
    'VOL '
    'VOL.'
    'VOL'
    'V'
  );

var
  S: String;
  P: PChar;
  I, J, Len: Integer;
  Volume: Char;
begin
  Volume = #0;
  S := UpperCase(SearchString);
  P := PChar(S);
  Len := Length(S);
  I := 1;
  while (Len > 0) and (Volume = #0) do
  begin
    if (P^ <> 'V') then begin
      Inc(P);
      Dec(Len);
      Continue;
    end;
    for J := Low(Prefixes) to High(Prefixes) do
    begin
      if AnsiStrLComp(P, PChar(Prefixes[J]), Length(Prefixes[J])) = 0 then
      begin
        Inc(P, Length(Prefixes[J]));
        Dec(Len, Length(Prefixes[J]));
        if (Len > 0) then begin
          if (P^ >= '1') and (P^ <= '7') then
            Volume := P^;
        end;
        Break;
      end;
    end;
  end;
end;

I had to do something similar once for comparing mailing addresses. I stripped out white space and punctuation. Then I used CompareText so it was case insensitive.

A lot of your If statements deal with comparing strings that may or may not have a period or space between "Vol" or "Volume" and the number. Remove the period and whitespace and you are left with two If statements per volume number: one for VOL and one for VOLUME. You might even be able to whittle that down to one If statement per volume by replacing "volume" with "vol".

Make your search string upper case first (once), and then do each check just against an upper case version of the search string. That reduces the number of checks by half without requiring case-insensitive searches (which may change case of both strings every time).

You could go a step further and use one of the wildcard match functions in the JCL such as StrMatches. However, while this would reduce the number of lines of code it could not be as fast as having the specific matches.

If you expect to make many different values for Volume, write your own function to search for the alphabetic part of the string, then do a separate check for what number comes after it.

If you want it easy but slow - go RegExp way.

If you want it fast, then read answer by @LeleDumbo.

BUT! Before real search make a copy of string all uppercase - AnsiUpperCase function. Case-insensitive search slows down on every character. It would be better to make upcase copy of both string and search patterns. (Oh, @RobMcDonell already told you that :-) )

You are to convert prefixes into tree. Okay, in this simple example it would fit into a list (array): "V", "OL", "UME" in more complex case you could have search for V-OL-UME or V-ER-SION with same start and splitting tails)

Then read about http://en.wikipedia.org/wiki/Finite-state_machine - that is what u have to do.

A simple draft (not covering all possible use cases, for example "Vol . 2.2" ) would be:

Start in search-txt-1 state, #1 char to look. On each loop you have current state and current number of character to think of(thinking all to the left already scanned):

  1. if state is search-txt-1, then search for txt-1 (namely "V") at current character and anywhere to the right ( System.StrUtils.PosEx function)

    1.1. If not found - exit the loop, no text found

    1.2. If found - inc(current-number), state := search-txt-2, next loop

  2. if state is search-txt-2, then search for txt-2 ("UM") at current character only! (lazy: System.Copy(txt, current-char, system.length(txt-2)) = txt-2; fast: special comparison with length and offset from Jedi CodeLibrary)

    2.1 if found, inc(current-number, length(txt-2), state := search-txt-3, next loop

    2.2 if not found, do NOT change current-number, state := skip-dot, next loop

  3. if state is search-txt-3, then search for txt-3 like above

    3.1 if found, inc(current-number, length(txt-3), state := skip-dot, next loop

    3.2 if not found, do NOT change current-number, state := skip-dot, next loop

  4. if state is skip-dot, look if current-char is dot

    4.1 if it is, inc (current-number), state := skip-few-blanks, next loop

    4.2 if it is not do NOT change current-number, state := skip-few-blanks, next loop

  5. if skip-few-blanks then look if current-char is " "

    5.1 if it is, inc (current-number), state := skip-few-blanks, next loop (there may be more blanks)

    5.2 if it is not do NOT change current-number, state := maybe-number, next loop

  6. if maybe-number then System.Character.IsDigit(current-char) ???

    6.1 if not - no number, search failed, next try - do NOT change current-number, state := search-txt-1, next loop

    6.2 if is, remember where number started, state := reading-number, inc (current-number), next loop

  7. if reading-number then System.Character.IsDigit(current-char) ???

    7.1 if it is - one more digit - state := reading-number, inc (current-number), next loop

    7.2 if it is not - number over - get slice of string from digit start to previous character (last digit), convert it (IntToStr(Copy(string, number-start, number-length)) and exit the loop ( you do not search several numbers in one string, do you? )

For more complex grammars there are tools like Yacc/Bison. But for such simple one you can maek your own custom FSM, it would be not hard but most fast way. Just be very attentive and not make errors in state transitions and current-char number shifts.

I hope i did not make but you have to test it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM