简体   繁体   中英

Most simple and most efficient ways of filtering a list in Java?

I want to filter irrelevant words from each phrase in an incoming stream of tweets.

I can do so using the ArrayList like this:

import java.util.ArrayList;

// Example Tweet
String tweetText = "Awful glad vaccine is coming at last! #COVID19";

// First convert tweet text to array of words
String text = tweetText
                .replaceAll("\\p{Punct}", "")
                .replaceAll("\\r|\\n", "")
                .toLowerCase();

String[] words = text.split(" ");

// We define an array of irrelevant words to be filtered out
String[] irrelevantWords = {"is", "at", "http", "https", "football"};

// first we create an extensible ArrayList to add filtered words to
ArrayList<String> filteredWords = new ArrayList<String>();

// we assume each word is relevant to begin with...
boolean relevant;

// ... and then we check by iterating over each word...
for (String w : words){
    
    // ... assuming initially that it is relevant ...
    relevant = true;
    
    // ... and iterating over each irrelevant word ...
    for (String irrelevant : irrelevantWords){
        
        // ... and if a word is the same as an irrelevant word
        if (w.equals(irrelevant)){ 
            
            // ... we know that it is not relevant.
            relevant = false; 
        }
    }
    // If, having compared the word to all the irrelevant words,
    // it is still found to be relevant, we add it to our ArrayList.
    if (relevant == true){filteredWords.add(w);}
}
// NB: This is not the most efficient method of filtering words,
// but it is the most simple to understand and implement.

System.out.println(filteredWords);

But while this is simple to understand and implement for someone new to Java (basically it just depends on iterative for loops, although we do have to import the ArrayList), it is inefficient.

What are the best ways (either simplest or more efficient) of doing this?

The most simple way to filter a preset list of irrelevant words from a string is by using a regex replace. The following code removes any occurrences of the words bad and words , but not of badass and nicewords :

String tweet = ...;
String filteredTweet = tweet.replaceAll("(?<=( |^))(bad|words)(?=( |$))", "");

You can add more words and even regexes to this list, separated by a |.

Here is one way. I added a word to the list

// Example Tweet
String tweetText = "Awful glad vaccine is coming at last! #COVID19";

// We define an array of irrelevant words to be filtered out
String[] irrelevantWords = {"is", "at", "http", "https", "last", "football"};
for (String irr : irrelevantWords) {
    tweetText = tweetText.replaceAll("\\s+\\b"+irr+"\\b","");
}
System.out.println(tweetText);

Prints

Awful glad vaccine coming! #COVID19

Definitely simpler, but not very efficient. But regular expressions are not necessarily efficient either. They are a general process by which to simply perform a task. Thus there is extra overhead. Usually writing a custom parser is more efficient but certainly not simpler.

Use a hashset to store the irrelevant words.

Set<String> irrelevantWords = new HashSet<String>();

Add the words to this set and use irrelevantWords.contains(word) to check if the word is irrelevant.

The lookup from hashset is O(1) against O(n) from a list/array. Since you are using lookup in a loop, this would greatly improve your performance.

If you work with collections, life is easier:

Set<String> irrelevantWords = Set.of("is", "at", "http", "https", "football"); // Actually a HashSet

List<String> filteredWords = Arrays.stream(text.split(" +"))
  .filter(word -> !irrelevantWords.contains(word))
  .collect(Collectors.toList());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM