简体繁体中英

How to extract data from a PDF file using Tika or any other library and store it in CSV/excel format

原文 2016-03-26 18:22:33 6 1 java/ excel/ pdf/ apache-tika

I want to extract the data present inside a PDF file and present it in the format of a CSV/Excel sheet.I got to know that this can be done using Tika library in java.But,i did find the solution as to how extract the data as simple text,but i want to know how to store it in an excel sheet.

If someone has done such type of work earlier,then please help me.

1 answers

The first part (and the hard one) is to parse original data and interpret it as a table. Apache Tika will give you xhtml representation (or call your own handler with SAX events) but it usually won't construct a table for you. From pdf file, I mean, since pdf isn't a tabular format by itself.

So, you'll have to take Tika-produced paragraphs, split them and pass resulting cells to some csv/xls/xlsx writter. It might work if you have some regular table in you pdf (one line per table row, clean cell logical separation etc). But it will look like parsing plain text, of course.

In case I wouldn't work, you'll have to take pdf parser (like Apache PDFBox ) and try to interpret its output.

The second part (output) is simple. If csv/ssv/tsv is suitable for you -- use your preferred library to produce it (I can recommend Apache commons-csv ). But take into account that MS Excel requires BOM for UTF-8 and UTF-16 csv to understand that file isn't in one-byte encoding (like CP-1252 etc).

If you want Excel xls or xlsx format -- just use Apache POI to write it.

Extract text from a pdf file using Apache Tika in java

Extract text from image in java using tika library

How to convert the excel file data into JSON format using GSON library?

How to extract data from a pdf file using JPedal?

Extract text from a large pdf with Tika

How to extract the data from a pdf File using iText

How to get Header and Footer from PDF file using apache tika in java

How to use Apache Tika to extract text from a .wps file?

How to read first few pages of a PDF file using TIKA

How to convert csv to nested beans using OpenCSV or any other library?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Extract text from a pdf file using Apache Tika in java Extract text from image in java using tika library How to convert the excel file data into JSON format using GSON library? How to extract data from a pdf file using JPedal? Extract text from a large pdf with Tika How to extract the data from a pdf File using iText How to get Header and Footer from PDF file using apache tika in java How to use Apache Tika to extract text from a .wps file? How to read first few pages of a PDF file using TIKA How to convert csv to nested beans using OpenCSV or any other library?

Related Tags

How to extract data from a PDF file using Tika or any other library and store it in CSV/excel format

Question

1 answers

solution1 1 ACCPTED 2016-03-28 17:51:42

solution1
1 ACCPTED 2016-03-28 17:51:42