简体   繁体   中英

How do I use custom stopwords and stemmer file in WEKA (Java)?

So far I have:

NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(2);
tokenizer.setNGramMaxSize(2); 
tokenizer.setDelimiters("[\\w+\\d+]");

StringToWordVector filter = new StringToWordVector();
// customize filter here
Instances data = Filter.useFilter(input, filter);

The API has these two methods for StringToWordVector:

setStemmer(Stemmer value);
setStopwordsHandler(StopwordsHandler value);

I have a text file containing the stopwords and another class that stems words. How do I use a custom stemmer and stopwords filter? Note that the I'm taking phrases of size 2, so I can't preprocess and remove all stopwords beforehand.

Update: This worked for me (using Weka developer version 3.7.12)

To use a custom stopwords handler:

public class MyStopwordsHandler implements StopwordsHandler {

    private HashSet<String> myStopwords;

    public MyStopwordsHandler() {
        //Load in your own stopwords, etc.
    }

    //Must implement this method from the StopwordsHandler interface
    public Boolean isStopword(String word) {
        return myStopwords.contains(word); 
    }

}

To use a custom stemmer, create a class that implements the Stemmer interface and write the implementations for these methods:

public String stem(String word) { ... }
public String getRevision() { ... } 

Then to use your custom stopwords handler and stemmer:

StringToWordVector filter = new StringToWordVector();
filter.setStemmer(new MyStemmer());
filter.setStopwordsHandler(new MyStopwordsHandler());

Note: The answer below by Thusitha works for the stable 3.6 verion, and it is much simpler than the one described above. But I could not get it to work with the 3.7.12 version.

In the latest weka library you can use

StringToWordVector filter = new StringToWordVector();
filter.setStopwords(new File("filename"));

I'm using following dependency

<dependency>
   <groupId>nz.ac.waikato.cms.weka</groupId>
   <artifactId>weka-stable</artifactId>
   <version>3.6.12</version>
</dependency>

In the API docs API Doc

public void setStopwords(java.io.File value) sets the file containing the stopwords, null or a directory unset the stopwords. If the file exists, it automatically turns on the flag to use the stoplist. Parameters: value - the file containing the stopwords

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM