[英]Java pdf to Excel Conversion
我正在从PDF提取数据到excel。 该PDF中也包含表格。 我使用Itext- pdf
来将PDF转换为文本,并借助apache poi
为excel。 但我无法检索要存储在数据库中的数据。 我试过PDF-BOX
, ASPOSE
也得到相同的结果。 如果有人知道,请帮助我解决此问题。
这是我的代码
//使用itext将pdf转换为文本
PdfReader reader = new PdfReader(
"C:\\Users\\mohmeds\\Desktop\\BOI_SCFS banking.pdf_page_1.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(
reader);
// PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
String line = null;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i,
new SimpleTextExtractionStrategy());
line = strategy.getResultantText();
}
reader.close();
// using apache poi text to excel converter
org.apache.poi.ss.usermodel.Workbook wb = new HSSFWorkbook();
CreationHelper helper = wb.getCreationHelper();
Sheet sheet = wb.createSheet("new sheet");
System.out.println("link------->" + line);
List<String> lines = IOUtils.readLines(new StringReader(line));
for (int i = 0; i < lines.size(); i++) {
String str[] = lines.get(i).split(",");
Row row = sheet.createRow((short) i);
for (int j = 0; j < str.length; j++) {
row.createCell(j).setCellValue(
helper.createRichTextString(str[j]));
}
}
FileOutputStream fileOut = new FileOutputStream(
"C:\\Users\\mohmeds\\Desktop\\someName1.xls");
wb.write(fileOut);
fileOut.close();
您的问题有点含糊,但是如果您希望将PDF中的数据存储到数据库中,则可能希望将数据提取为CSV而不是Excel。 同样,此处的代码省去了将PDF转换为Text,然后将Text转换为Excel的中间步骤。 定义格式时,选择“ csv”:
package com.pdftables.examples;
import java.io.File;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.FileUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.config.CookieSpecs;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.mime.MultipartEntityBuilder;
import org.apache.http.entity.mime.content.FileBody;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
public class ConvertToFile {
private static List<String> formats = Arrays.asList(new String[] { "csv", "xml", "xlsx-single", "xlsx-multiple" });
public static void main(String[] args) throws Exception {
if (args.length != 3) {
System.out.println("Command line: <API_KEY> <FORMAT> <PDF filename>");
System.exit(1);
}
final String apiKey = args[0];
final String format = args[1].toLowerCase();
final String pdfFilename = args[2];
if (!formats.contains(format)) {
System.out.println("Invalid output format: \"" + format + "\"");
System.exit(1);
}
// Avoid cookie warning with default cookie configuration
RequestConfig globalConfig = RequestConfig.custom().setCookieSpec(CookieSpecs.STANDARD).build();
File inputFile = new File(pdfFilename);
if (!inputFile.canRead()) {
System.out.println("Can't read input PDF file: \"" + pdfFilename + "\"");
System.exit(1);
}
try (CloseableHttpClient httpclient = HttpClients.custom().setDefaultRequestConfig(globalConfig).build()) {
HttpPost httppost = new HttpPost("https://pdftables.com/api?format=" + format + "&key=" + apiKey);
FileBody fileBody = new FileBody(inputFile);
HttpEntity requestBody = MultipartEntityBuilder.create().addPart("f", fileBody).build();
httppost.setEntity(requestBody);
System.out.println("Sending request");
try (CloseableHttpResponse response = httpclient.execute(httppost)) {
if (response.getStatusLine().getStatusCode() != 200) {
System.out.println(response.getStatusLine());
System.exit(1);
}
HttpEntity resEntity = response.getEntity();
if (resEntity != null) {
final String outputFilename = getOutputFilename(pdfFilename, format.replaceFirst("-.*$", ""));
System.out.println("Writing output to " + outputFilename);
final File outputFile = new File(outputFilename);
FileUtils.copyToFile(resEntity.getContent(), outputFile);
} else {
System.out.println("Error: file missing from response");
System.exit(1);
}
}
}
}
private static String getOutputFilename(String pdfFilename, String suffix) {
if (pdfFilename.length() >= 5 && pdfFilename.toLowerCase().endsWith(".pdf")) {
return pdfFilename.substring(0, pdfFilename.length() - 4) + "." + suffix;
} else {
return pdfFilename + "." + suffix;
}
}
}
https://github.com/pdftables/java-pdftables-api/blob/master/pdftables.java
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.