简体   繁体   中英

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

此外,我想知道如何在索引时添加元数据,以便我可以提升一些参数

There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

  • One of them is Apache Tika , a sub-project of Lucene.
  • Apache POI is a more general document handling project inside Apache.
  • There are also some commercial alternatives.

You can use Apache Tika . Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image formats
  • Video formats
  • Java class files and archives
  • The mbox format

The code will look like this. Reader reader = new Tika().parse(stream);

有关使用PDFBox和Apache Lucene将PDF文件逐页拆分为文本的java解决方案,请参阅https://github.com/WolfgangFahl/pdfindexer ,索引这些文本页面并创建链接到页面的结果html索引文件在pdf源中使用相应的open参数。

Lucene索引文本而不是文件 - 你需要一些其他的过程来从文件中提取文本并运行Lucene。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM