[英]java- Full text inverted index defining a word
I am working on a simple full text inverted index trying to build an index of words that I extract from PDF files. 我正在研究一个简单的全文本倒排索引,试图建立从PDF文件中提取的单词索引。 I am using PDFBox library to achieve this.
我正在使用PDFBox库来实现此目的。
However, I would like to know how does one define a definition of word to index.The way my indexing works is define every word with a space is a word token. 但是,我想知道如何定义要索引的单词的定义。我的索引工作方式是用空格定义每个单词都是单词标记。 For example,
例如,
This string, is a code.
In this case: the index table would contain 在这种情况下:索引表将包含
This
string,
is
a
code.
The flaw here is for like string,
, it comes with a comma where I think string
would just be sufficient enough because nobody searches string,
or code.
这里的缺陷是
string,
它带有一个逗号,我认为string
就足够了,因为没有人搜索string,
或code.
Back to my question, is there a specific rule there I could use to define my word token in a way to prevent this kind of issue with what I have ? 回到我的问题,我是否可以使用特定的规则来定义我的单词标记,以防止我所拥有的此类问题?
Code: 码:
File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\\PDF1\\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for(String line : lines) {
String[] words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
}
}
If you wanted to remove all punctuation you could do: 如果要删除所有标点符号,可以执行以下操作:
for(String word : words) {
uniqueWords.add(word.replaceAll("[.,!?]", ""));
}
Which will replace all periods, commas, exclamation marks, and question marks. 它将替换所有的句号,逗号,感叹号和问号。
If you also want to get rid of quotes you can do: 如果您还想摆脱引号,可以执行以下操作:
uniqueWords.add(word.replaceAll("[.,?!\"]", "")
Yes. 是。 You can use replaceAll method to get rid of non-word characters like this:
您可以使用replaceAll方法来摆脱非单词字符,如下所示:
uniqueWords.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.