简体   繁体   English

在 Java 中过滤列表的最简单和最有效的方法?

[英]Most simple and most efficient ways of filtering a list in Java?

I want to filter irrelevant words from each phrase in an incoming stream of tweets.我想从传入的 stream 推文中的每个短语中过滤不相关的词。

I can do so using the ArrayList like this:我可以像这样使用 ArrayList 这样做:

import java.util.ArrayList;

// Example Tweet
String tweetText = "Awful glad vaccine is coming at last! #COVID19";

// First convert tweet text to array of words
String text = tweetText
                .replaceAll("\\p{Punct}", "")
                .replaceAll("\\r|\\n", "")
                .toLowerCase();

String[] words = text.split(" ");

// We define an array of irrelevant words to be filtered out
String[] irrelevantWords = {"is", "at", "http", "https", "football"};

// first we create an extensible ArrayList to add filtered words to
ArrayList<String> filteredWords = new ArrayList<String>();

// we assume each word is relevant to begin with...
boolean relevant;

// ... and then we check by iterating over each word...
for (String w : words){
    
    // ... assuming initially that it is relevant ...
    relevant = true;
    
    // ... and iterating over each irrelevant word ...
    for (String irrelevant : irrelevantWords){
        
        // ... and if a word is the same as an irrelevant word
        if (w.equals(irrelevant)){ 
            
            // ... we know that it is not relevant.
            relevant = false; 
        }
    }
    // If, having compared the word to all the irrelevant words,
    // it is still found to be relevant, we add it to our ArrayList.
    if (relevant == true){filteredWords.add(w);}
}
// NB: This is not the most efficient method of filtering words,
// but it is the most simple to understand and implement.

System.out.println(filteredWords);

But while this is simple to understand and implement for someone new to Java (basically it just depends on iterative for loops, although we do have to import the ArrayList), it is inefficient.但是,虽然这对于 Java 的新手来说很容易理解和实现(基本上它只依赖于迭代循环,尽管我们必须导入 ArrayList),但效率很低。

What are the best ways (either simplest or more efficient) of doing this?执行此操作的最佳方法(最简单或更有效)是什么?

The most simple way to filter a preset list of irrelevant words from a string is by using a regex replace.从字符串中过滤不相关单词的预设列表的最简单方法是使用正则表达式替换。 The following code removes any occurrences of the words bad and words , but not of badass and nicewords :以下代码删除所有出现的单词badwords ,但不包括badassnicewords

String tweet = ...;
String filteredTweet = tweet.replaceAll("(?<=( |^))(bad|words)(?=( |$))", "");

You can add more words and even regexes to this list, separated by a |您可以向此列表添加更多单词甚至正则表达式,以|分隔. .

Here is one way.这是一种方法。 I added a word to the list我在列表中添加了一个词

// Example Tweet
String tweetText = "Awful glad vaccine is coming at last! #COVID19";

// We define an array of irrelevant words to be filtered out
String[] irrelevantWords = {"is", "at", "http", "https", "last", "football"};
for (String irr : irrelevantWords) {
    tweetText = tweetText.replaceAll("\\s+\\b"+irr+"\\b","");
}
System.out.println(tweetText);

Prints印刷

Awful glad vaccine coming! #COVID19

Definitely simpler, but not very efficient.绝对更简单,但效率不高。 But regular expressions are not necessarily efficient either.但是正则表达式也不一定有效。 They are a general process by which to simply perform a task.它们是简单地执行任务的一般过程。 Thus there is extra overhead.因此有额外的开销。 Usually writing a custom parser is more efficient but certainly not simpler.通常编写自定义解析器效率更高,但肯定不会更简单。

Use a hashset to store the irrelevant words.使用哈希集存储不相关的词。

Set<String> irrelevantWords = new HashSet<String>();

Add the words to this set and use irrelevantWords.contains(word) to check if the word is irrelevant.将单词添加到该集合并使用irrelevantWords.contains(word)检查单词是否不相关。

The lookup from hashset is O(1) against O(n) from a list/array.哈希集的查找是 O(1) 对列表/数组的 O(n)。 Since you are using lookup in a loop, this would greatly improve your performance.由于您在循环中使用查找,这将大大提高您的性能。

If you work with collections, life is easier:如果您使用 collections 工作,生活会更轻松:

Set<String> irrelevantWords = Set.of("is", "at", "http", "https", "football"); // Actually a HashSet

List<String> filteredWords = Arrays.stream(text.split(" +"))
  .filter(word -> !irrelevantWords.contains(word))
  .collect(Collectors.toList());

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM