简体繁体中英

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

原文 2010-04-06 06:03:10 4 4 java/ indexing/ lucene

此外，我想知道如何在索引时添加元数据，以便我可以提升一些参数

4 answers

There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

One of them is Apache Tika , a sub-project of Lucene.
Apache POI is a more general document handling project inside Apache.
There are also some commercial alternatives.

You can use Apache Tika . Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

The code will look like this. Reader reader = new Tika().parse(stream);

有关使用PDFBox和Apache Lucene将PDF文件逐页拆分为文本的java解决方案，请参阅https://github.com/WolfgangFahl/pdfindexer ，索引这些文本页面并创建链接到页面的结果html索引文件在pdf源中使用相应的open参数。

Lucene索引文本而不是文件 - 你需要一些其他的过程来从文件中提取文本并运行Lucene。

How to index pdf file with lucene

How to do content search for multiple pdf files using lucene in java

How to Convert a ppt file into pdf file in Java?

Index Markdown Files Using Lucene in Java

How to convert PDF file into PPT file using java?

How to read lucene 4.0 index with java?

Lucene delete index, Java

Lucene Index files getting Corrupted

Any java library for ppt to jpg conversion?

Using Lucene, how to index TXT files into different fields?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to index pdf file with lucene How to do content search for multiple pdf files using lucene in java How to Convert a ppt file into pdf file in Java? Index Markdown Files Using Lucene in Java How to convert PDF file into PPT file using java? How to read lucene 4.0 index with java? Lucene delete index, Java Lucene Index files getting Corrupted Any java library for ppt to jpg conversion? Using Lucene, how to index TXT files into different fields?

Related Tags

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

Question

4 answers

solution1
4 2010-04-06 07:56:58

solution2
2 2010-04-16 14:04:38

solution3
1 2013-05-12 07:44:16

solution4
1 ACCPTED 2010-04-06 06:11:35

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

Question

4 answers

solution1 4 2010-04-06 07:56:58

solution2 2 2010-04-16 14:04:38

solution3 1 2013-05-12 07:44:16

solution4 1 ACCPTED 2010-04-06 06:11:35

solution1
4 2010-04-06 07:56:58

solution2
2 2010-04-16 14:04:38

solution3
1 2013-05-12 07:44:16

solution4
1 ACCPTED 2010-04-06 06:11:35