此外,我想知道如何在索引时添加元数据,以便我可以提升一些参数
There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)
You can use Apache Tika . Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Supported Document Formats
The code will look like this. Reader reader = new Tika().parse(stream);
有关使用PDFBox和Apache Lucene将PDF文件逐页拆分为文本的java解决方案,请参阅https://github.com/WolfgangFahl/pdfindexer ,索引这些文本页面并创建链接到页面的结果html索引文件在pdf源中使用相应的open参数。
Lucene索引文本而不是文件 - 你需要一些其他的过程来从文件中提取文本并运行Lucene。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.