[英]Removing stopwords from a String in Java
我有一個包含大量單詞的字符串,我有一個文本文件,其中包含一些需要從我的字符串中刪除的停用詞。 假設我有一個字符串
s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
刪除停用詞后,字符串應為:
"love phone, super fast much cool jelly bean....but recently bugs."
我已經能夠實現這一點,但我遇到的問題是,當字符串中有相鄰的停用詞時,它只刪除第一個,我得到的結果如下:
"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"
這是我的stopwordslist.txt文件:停用詞
我怎么解決這個問題。 這是我到目前為止所做的:
int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
FileReader fr=new FileReader("F:\\stopwordslist.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null){
stopwords[k]=sCurrentLine;
k++;
}
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
StringBuilder builder = new StringBuilder(s);
String[] words = builder.toString().split("\\s");
for (String word : words){
wordsList.add(word);
}
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
}catch(Exception ex){
System.out.println(ex);
}
這是一個更優雅的解決方案(恕我直言),只使用正則表達式:
// instead of the ".....", add all your stopwords, separated by "|"
// "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
// the "\\s?" is to suppress optional trailing white space
Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
String s = m.replaceAll("");
System.out.println(s);
試試下面的程序。
String s="I love this phone, its super fast and there's so" +
" much new and cool things with jelly bean....but of recently I've seen some bugs.";
String[] words = s.split(" ");
ArrayList<String> wordsList = new ArrayList<String>();
Set<String> stopWordsSet = new HashSet<String>();
stopWordsSet.add("I");
stopWordsSet.add("THIS");
stopWordsSet.add("AND");
stopWordsSet.add("THERE'S");
for(String word : words)
{
String wordCompare = word.toUpperCase();
if(!stopWordsSet.contains(wordCompare))
{
wordsList.add(word);
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
輸出:愛手機,它的超快速這么多新的涼爽的東西與果凍豆....但最近我看到了一些錯誤。
你可以像這樣使用replace All功能
String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");
該錯誤是因為您從迭代的列表中刪除元素。 讓說,你有wordsList
包含|word0|word1|word2|
如果ii
等於1
且if測試為真,則調用wordsList.remove(1);
。 之后你的名單是|word0|word2|
。 ii
然后遞增並等於2
,現在它高於列表的大小,因此word2
將永遠不會被測試。
從那里有幾種解決方案。 例如,您可以將值設置為“”而不是刪除值。 或者創建一個特殊的“結果”列表。
這是以下方式嘗試:
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords[]={"love","this","cool"};
for(int i=0;i<stopWords.length;i++){
if(s.contains(stopWords[i])){
s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
}
}
System.out.println(s);
這樣你的最終輸出將沒有你不想要的單詞。 只需獲取數組中的停用詞列表並替換為必需的字符串。
輸出我的停用詞:
I phone, its super fast and there's so much new and things with jelly bean....but of recently I've seen some bugs.
相反,為什么不使用下面的方法。 它更容易閱讀和理解:
for(String word : words){
s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.
嘗試使用String的replaceAll api,如:
String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);
OUTPUT:
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.
嘗試將停用詞存儲在集合集合中,然后將字符串標記為列表。 之后您可以簡單地使用'removeAll'來獲得結果。
Set<String> stopwords = new Set<>()
//fill in the set with your file
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));
listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");
不需要循環 - 它們通常意味着問題。
似乎你停止了一句話,一個句子被移除到另一個停止詞:你需要刪除每個句子中的所有停止詞。
您應該嘗試更改代碼:
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for(int ii = 0; ii < wordsList.size(); ii++)
{
for(int jj = 0; jj < k; jj++)
{
if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
{
wordsList.remove(ii);
}
}
}
請注意, break
被刪除, word.contains(stopword)
stopword.contains(word)
更改為word.contains(stopword)
。
最近,在完成了一些博客和文章之后,該項目中的一個項目需要過濾來自給定文本或文件的停止/詞干和咒罵詞的功能。 創建了一個簡單的庫來過濾數據/文件並在maven中可用。 希望這可能對某人有所幫助。
https://github.com/uttesh/exude
<dependency>
<groupId>com.uttesh</groupId>
<artifactId>exude</artifactId>
<version>0.0.2</version>
</dependency>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.