简体   繁体   English

StandardAnalyzer-Apache Lucene

[英]StandardAnalyzer - Apache Lucene

I'm actually developing a system where you input some text files to a StandardAnalyzer, and the contents of that file are then replaced by the output of the StandardAnalyzer (which tokenizes and removes all the stop words). 我实际上是在开发一个系统,在该系统中,您将一些文本文件输入到StandardAnalyzer,然后将该文件的内容替换为StandardAnalyzer的输出(它将标记化并删除所有停用词)。 The code ive developed till now is : 到目前为止,我开发的代码是:

    File f = new File(path);

    TokenStream stream = analyzer.tokenStream("contents", 
            new StringReader(readFileToString(f)));

    CharTermAttribute charTermAttribute = stream.getAttribute(CharTermAttribute.class);

        while (stream.incrementToken()) {
            String term = charTermAttribute.toString();
            System.out.print(term);
        }

           //Following is the readFileToString(File f) function
     StringBuilder textBuilder = new StringBuilder();
     String ls = System.getProperty("line.separator");
     Scanner scanner = new Scanner(new FileInputStream(f));

     while (scanner.hasNextLine()){
          textBuilder.append(scanner.nextLine() + ls);
      }
      scanner.close();
    return textBuilder.toString();

The readFileToString(f) is a simple function which converts the file contents to a string representation. readFileToString(f)是一个简单的函数,它将文件内容转换为字符串表示形式。 The output i'm getting are the words each with the spaces or the new line between them removed. 我得到的输出是每个单词,其中空格或它们之间的新行已删除。 Is there a way to preserve the original spaces or the new line characters after the analyzer output, so that i can replace the original file contents with the filtered contents of the StandardAnalyzer and present it in a readable form? 有没有一种方法可以在分析器输出之后保留原始空格或换行符,以便我可以将原始文件内容替换为StandardAnalyzer的过滤内容,并以可读的形式显示?

Tokenizers save the term position, so in theory you could look at the position to determine how many characters there are between each token, but they don't save the data which was between the tokens. 断词保存它的位置,所以从理论上讲,你可以看一下位置,以确定有多少个字符有每个标记之间,但他们不救这是令牌之间的数据。 So you could get back spaces, but not newlines. 这样您可以退回空格,但不能换行。

If you're comfortable with JFlex you could modify the tokenizer to treat newlines as a token. 如果您熟悉JFlex的,你可以修改标记生成器来治疗换行符作为标记。 That's probably harder than any gain you'd get from it though. 这可能比你从它那里得到任何虽然增益更难。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM