java-全文反向索引定义一个词

Question

I am working on a simple full text inverted index trying to build an index of words that I extract from PDF files. 我正在研究一个简单的全文本倒排索引，试图建立从PDF文件中提取的单词索引。 I am using PDFBox library to achieve this. 我正在使用PDFBox库来实现此目的。

However, I would like to know how does one define a definition of word to index.The way my indexing works is define every word with a space is a word token. 但是，我想知道如何定义要索引的单词的定义。我的索引工作方式是用空格定义每个单词都是单词标记。 For example, 例如，

This string, is a code.

In this case: the index table would contain 在这种情况下：索引表将包含

This
string,
is
a
code.

The flaw here is for like string, , it comes with a comma where I think string would just be sufficient enough because nobody searches string, or code. 这里的缺陷是string,它带有一个逗号，我认为string就足够了，因为没有人搜索string,或code.

Back to my question, is there a specific rule there I could use to define my word token in a way to prevent this kind of issue with what I have ? 回到我的问题，我是否可以使用特定的规则来定义我的单词标记，以防止我所拥有的此类问题？

Code: 码：

File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();

for (File file : listOfFiles) {
   if (file.isFile()) {
      HashSet<String> uniqueWords = new HashSet<>();
      String path = "D:\\PDF1\\" + file.getName();
      try (PDDocument document = PDDocument.load(new File(path))) {    
          if (!document.isEncrypted()) {    
             PDFTextStripper tStripper = new PDFTextStripper();
             String pdfFileInText = tStripper.getText(document);
             String lines[] = pdfFileInText.split("\\r?\\n");
             for(String line : lines) {
                String[] words = line.split(" ");    
                for (String word : words) {
                    uniqueWords.add(word);   
                }

             }                            
          }
       } catch (IOException e) {
         System.err.println("Exception while trying to read pdf document - " + e);
       }
   }
}

Answer 1

If you wanted to remove all punctuation you could do: 如果要删除所有标点符号，可以执行以下操作：

for(String word : words) {
    uniqueWords.add(word.replaceAll("[.,!?]", ""));
}

Which will replace all periods, commas, exclamation marks, and question marks. 它将替换所有的句号，逗号，感叹号和问号。

If you also want to get rid of quotes you can do: 如果您还想摆脱引号，可以执行以下操作：

uniqueWords.add(word.replaceAll("[.,?!\"]", "")

Answer 2

Yes. 是。 You can use replaceAll method to get rid of non-word characters like this: 您可以使用replaceAll方法来摆脱非单词字符，如下所示：

uniqueWords.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));

java-全文反向索引定义一个词

问题描述

2 个解决方案

解决方案1
2 2018-11-14 01:54:31

解决方案2
1 已采纳 2018-11-14 02:14:07

java-全文反向索引定义一个词

问题描述

2 个解决方案

解决方案1 2 2018-11-14 01:54:31

解决方案2 1 已采纳 2018-11-14 02:14:07

解决方案1
2 2018-11-14 01:54:31

解决方案2
1 已采纳 2018-11-14 02:14:07