简体   繁体   中英

Delphi removing large amount of partial matching strings from large array

I have 2 string array, array #1 contains about 2.5 million strings, array #2 about 4.5 million strings. I need to check if the strings in array #2 are within the strings of array #1 and then remove them.

Due to the "string contains another string" requirement, I cannot use any binary search etc. and the process take at the moment 30+ hours.

What I mean with "string contains another string" is for example, array #1 contains a string "houseboat", array #2 contains somewhere "house", so "house" is in "houseboat" which means I will have to remove "houseboat" from array #1.

Example (not actual, not working either) code to explain it better:

for i:=0 to length(array1)-1 do
begin
  for j:=0 to length(array2)-1 do
  begin
    if ansicontainstext(array1[i],array2[j]) then
    begin
      martrecordtoremove;
      break;
    end;
  end;
end;

This will take about 30 hours for all strings.

So my question is, is there any way to do this faster?

To avoid naive string search, you have to exploit string search algorithms intended for fast search of whole set of patterns in text (wiki) .

The simplest implementation is for Rabin-Karp algorithm.
The best complexity in the worst case is for Aho-Corasick one.

Average case is close for both algos, so it is worth to check RK speed for your purposes first.


Another possible issue - how is martrecordtoremove implemented? For effective removing you should eliminate multiple memory reallocations.

you can do binary search, but the index will be quite big (around 50 millions entries, not a disaster). the most easy try to create a sphinxsearch index, inside you have a param to set word inside word (internally it's mean that for houseboat sphinx search will add these keywords to it's index) :

houseboat
       at
      oat
     boat
etc..

return from the search will be immediate and creation of the index should be quite fast

It seems to me that part of the issue is that you are looping 2.5M * 4.5M times. Have you tried using TStringList instead of arrays? If your arrays were TStringList instead (say SA1, SA2), you could write code such as this:

var
  i, j: integer;
begin
  SA1.CaseSensitive:=false;
  SA2.CaseSensitive:=false;
  SA1.Sort;
  SA2.Sort;
  for i := 0 to SA2.Count-1 do
  begin
    while true do
    begin
      //if we delete all the SA1 items, no more processing is required
      if SA1.Count=0 then
        exit;
      //find the occurrence of SA2[i] in SA1
      SA1.Find(SA2[i], j);
      //Check if the line at item j in SA1 contains the text of SA2[i]
      if Pos (SA2[i], SA1[j]) > 0 then //yes, then we delete it
        SA1.Delete(j)
      else
        if (j-1>=0) and (Pos (SA2[i], SA1[j-1]) > 0) then //else check the previous line to see if that has the text
          SA1.Delete(j-1)
        else
          if (j+1<SA1.Count) and (Pos (SA2[i], SA1[j+1]) > 0) then //else check the next line
            SA1.Delete(j+1)
          else //otherwise break out of while loop
            break;
    end;
  end;
end;

We are using Find on a case insensitive (and sorted) list of trings. This runs thru the 4.5M item list just once - and deletes items from the 2.5M array as it goes (ie SA1 shrinks during the loop). At the end of the loop, SA1 will contain only the strings you want (which do not exist in SA2). Perhaps you should give this a try and see if it works for you (and hopefully improves performance)?

Hope this helps.

UPDATE 20170221: I updated the code to use Find to find the string index (of SA2 in SA1) before deleting it. And I also updated the code to allow for the possibility of the SA2 string occurring more than once in SA1. I have also updated the code to make the Stringlist operations case insensitive.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM