简体   繁体   中英

Looping through an ArrayList with another Arraylist in Java

I have a large array list of sentences and another array list of words.

My program loops through the array list and removes an element from that array list if the sentence contains any of the words from the other.

The sentences array list can be very large and I coded a quick and dirty nested for loop. While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.

for (int i = 0; i < SENTENCES.size(); i++) {

        for (int k = 0; k < WORDS.size(); k++) {

            if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {

                //Do something
            }
        }
    }

Is there a more efficient way of doing this then a nested for loop?

There's a few inefficiencies in your code, but at the end of the day, if you've got to search for sentences containing words then there's no getting away from loops.

That said, there are couple of things to try.

First, make WORDS a HashSet , the contains method will be far quicker than for an ArrayList because it's doing a hash look-up to get the value.

Second, switch the logic about a bit like this:

Iterator<String> sentenceIterator = SENTENCES.iterator();

sentenceLoop:
while (sentenceIterator.hasNext())
{
  String sentence = sentenceIterator.next();

  for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
  {
    if (WORDS.contains(word))
    {
      sentenceIterator.remove();
      continue sentenceLoop;
    }
  }      
}    

This code (which assumes you're trying to remove sentences that contain certain words) uses Iterator s and avoids the string concatenation and parsing logic you had in your original code (replacing it with a single regex) both of which should be quicker.

But bear in mind, as with all things performance you'll need to test these changes to see they improve the situation.

I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶,̶ ̶b̶u̶t̶ what you must change is the way you handle the removal of the data. This is noted by this part of the explanation of your problem:

The sentences array list can be very large (...). While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.

The cause of this is that removal time in ArrayList takes O(N), and since you're doing this inside a loop, then it will take at least O(N^2).

I recommend using LinkedList rather than ArrayList to store the sentences, and use Iterator rather than your naive List#get since it already offers Iterator#remove in time O(1) for LinkedList .

In case you cannot change the design to LinkedList , I recommend storing the sentences that are valid in a new List , and in the end replace the contents of your original List with this new List , thus saving lot of time.

Apart from this big improvement, you can improve the algorithm even more by using a Set to store the words to lookup rather than using another List since the lookup in a Set is O(1).

I will create a set of words from second ArrayList:

Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");

I will then iterate over the set and the first ArrayList and use Contains:

for (String word : listOfWords) {
     for(String sentence : Sentences) {
           if (sentence.contains(word)) {
                // do something
           }
     }
 }

Also, if you are free to use any open source jar, check this out:

searching string in another string

First, your program has a bug: it would not count words at the beginning and at the end of a sentence.

Your current program has runtime complexity of O(s*w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.

If words is relatively small (a few hundred items or so) you could use regex to speed things up considerably: construct a pattern like this, and use it in a loop:

StringBuilder regex = new StringBuilder();
boolean first = true;
// Let's say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
    if (!first) {
        regex.append('|');
    } else {
        first = false;
    }
    regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b's on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
    if (p.matcher(SENTENCES.get(i)).find()) {
        // Do something
    }
}

Since regex gets pre-compiled into a structure more suitable for fast searches, your program would run in O(s*max(w)), where s is the length, in characters, of all sentences, and w is the length of the longest word. Given that the number of words in your collection is about 200 or 300, this could give you an order of magnitude decrease in running time.

What you could do is put all your words into a HashSet. This allows you to check if a word is in the set very quickly. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.

HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
    wordSet.add(word);
}

Then it's just a matter of splitting each sentence into the words that make it up, and checking if any of those words are in the set.

for (String sentence : SENTENCES) {
    String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
    for (String word : sentenceWords) {
        if (wordSet.contains(word)) {
            // The sentence contains one of the special words.
            // DO SOMETHING
            break;
        }
    }
}

If you have enough memory you can tokenize SENTENCES and put them in a Set. Then it would be better in performance and also more correct than current implementation.

Well, looking at your code I would suggest two things that will improve the performance from each iteration:

  1. Remove " == true". The contains operation already returns a boolean, so it is enough for the if, comparing it with true adds one extra operation for each iteration that is not needed.
  2. Do not concatenate Strings inside a loop ( " " + WORDS.get(k) + " " ) as it is a quite expensive operation because + operator creates new objects. Better use a string buffer / builder and clear it after each iteration with stringBuffer.setLength(0); .

Besides that, for this case I do not know any other approach, maybe you can use regular expressions if you can abstract a pattern out of those words you want to remove and have then only one loop.

Hope it helps!

If you concern about the efficiency, I think that the most effective way to do this is to use Aho-Corasick's algorithm. While you have 2 nested loops here and a contains() method (that I think takes at the best length of sentence + length of word time), Aho-Corasick gives you one loop over sentences and for checking of containing words it takes length of sentence , which is length of word times faster (+ a preprocessing time for creation of finite state machine, which is relatively small).

I'll approach this in more theoretical view.. If you don't have memory limitation, you can try to mimic the logic in counting sort

say M1 = sentences.size, M2 = number of word per sentences, and N = word.size
Assume all sentences has the same number of words just for simplicity
your current approach's complexity is O(M1.M2.N)

We can create a mapping of words - position in sentences. Loop through your arraylist of sentences, and change them into two dimensional jagged array of words. Loop through the new array, create a HashMap where key,value = words, arraylist of word position (say with length X).
That's O(2M1.M2.X) = O(M1.M2.X)

Then loop through your words arraylist, access your word hashmap, loop through the list of word position. remove each one. That's O(NX)

Say you're need to give the result in arraylist of string, we need another loop and concat everything. That's O(M1.M2)

Total complexity is O(M1.M2.X) + O(NX) + O(M1.M2)
assumming X is way smaller than N, you'll probably get better performance

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM