從字符串數組中刪除 Java 中的停用詞的最省時方法

Question

如何以最有效的方式刪除這些停用詞。 下面的方法不會刪除停用詞。 我錯過了什么？

有沒有其他方法可以做到這一點？

我想在 Java 中以最省時的方式完成此任務。

public static HashSet<String> hs = new HashSet<String>();


public static String[] stopwords = {"a", "able", "about",
        "across", "after", "all", "almost", "also", "am", "among", "an",
        "and", "any", "are", "as", "at", "b", "be", "because", "been",
        "but", "by", "c", "can", "cannot", "could", "d", "dear", "did",
        "do", "does", "e", "either", "else", "ever", "every", "f", "for",
        "from", "g", "get", "got", "h", "had", "has", "have", "he", "her",
        "hers", "him", "his", "how", "however", "i", "if", "in", "into",
        "is", "it", "its", "j", "just", "k", "l", "least", "let", "like",
        "likely", "m", "may", "me", "might", "most", "must", "my",
        "neither", "n", "no", "nor", "not", "o", "of", "off", "often",
        "on", "only", "or", "other", "our", "own", "p", "q", "r", "rather",
        "s", "said", "say", "says", "she", "should", "since", "so", "some",
        "t", "than", "that", "the", "their", "them", "then", "there",
        "these", "they", "this", "tis", "to", "too", "twas", "u", "us",
        "v", "w", "wants", "was", "we", "were", "what", "when", "where",
        "which", "while", "who", "whom", "why", "will", "with", "would",
        "x", "y", "yet", "you", "your", "z"};
public StopWords()
{
    int len= stopwords.length;
    for(int i=0;i<len;i++)
    {
        hs.add(stopwords[i]);
    }
    System.out.println(hs);
}

public List<String> removedText(List<String> S)
{
    Iterator<String> text = S.iterator();

    while(text.hasNext())
    {
        String token = text.next();
        if(hs.contains(token))
        {

                S.remove(text.next());
        }
        text = S.iterator();
    }
    return S;
}

Answer 1

遍歷列表時，您不應操縱列表。 而且，您在計算hasNext()的同一循環下調用了next()兩次。 相反，您應該使用迭代器刪除該項：

public static List<String> removedText(List<String> s) {
    Iterator<String> text = s.iterator();

    while (text.hasNext()) {
        String token = text.next();
        if (hs.contains(token)) {
            text.remove();
        }
    }
    return s;
}

但這有點“重新發明輪子”，您可以只使用removeAll(Collcetion)方法：

s.removeAll(hs);

Answer 2

也許您可以在循環內使用org / apache / commons / lang / ArrayUtils。

stopwords = ArrayUtils.removeElement(stopwords, element)

https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/ArrayUtils.html

Answer 3

我認為最有效的方法是使用 binarySearch 方法和排序的術語列表

int indexStop = Collections.binarySearch(EncyclopediaHelper.getStopWords(), string, String::compareToIgnoreCase);

boolean stop = indexStop > 0

此處提供更多信息： Collections.binarySearch 與手動搜索列表相比的性能如何？

Answer 4

請嘗試以下更改建議：

public static List<String> removedText(List<String> S)
{
    Iterator<String> text = S.iterator();

    while(text.hasNext())
    {
        String token = text.next();
        if(hs.contains(token))
        {

                S.remove(token); ////Changed text.next() --> token
        }
       // text = S.iterator(); why the need to re-assign?
    }
    return S;
}

從字符串數組中刪除 Java 中的停用詞的最省時方法

問題描述

4 個解決方案

解決方案1
1 已采納 2016-01-20 08:06:18

解決方案2
0 2016-01-20 08:19:37

解決方案3
0 2022-07-21 08:45:47

解決方案4
-1 2016-01-20 06:17:37

從字符串數組中刪除 Java 中的停用詞的最省時方法

問題描述

4 個解決方案

解決方案1 1 已采納 2016-01-20 08:06:18

解決方案2 0 2016-01-20 08:19:37

解決方案3 0 2022-07-21 08:45:47

解決方案4 -1 2016-01-20 06:17:37

解決方案1
1 已采納 2016-01-20 08:06:18

解決方案2
0 2016-01-20 08:19:37

解決方案3
0 2022-07-21 08:45:47

解決方案4
-1 2016-01-20 06:17:37