[英]How to extract text from pdfs with pentaho?
如何使用 pentaho 从 PDF 文件中读取文本?
是否有仅使用内置 java 库的解决方案?
只需使用以下代码添加步骤JavaScript Modified Values
:
var reader = new com.lowagie.text.pdf.PdfReader("c:\\temp\\mypdf.pdf") // OR JUST PUT THE COLUMN NAME IN THE FLOW;
var pdfTE = new com.lowagie.text.pdf.parser.PdfTextExtractor(reader);
var noOfPages = reader.getNumberOfPages();
var textPDF = "";
for (var i = 1; i <= noOfPages; i++) {
textPDF += pdfTE.getTextFromPage(i);
}
我遵循了以下步骤:
2.a. Class代码
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
//String firstnameField;
//String lastnameField;
String nameField;
//https://www.w3schools.blog/itext-read-pdf-file-in-java
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
// First, get a row from the default input hop
//
Object[] r = getRow();
// If the row object is null, we are done processing.
//
if (r == null) {
setOutputDone();
return false;
}
if (first) {
//firstnameField = getParameter("FIRSTNAME_FIELD");
//lastnameField = getParameter("LASTNAME_FIELD");
nameField = getParameter("NAME_FIELD");
first=false;
}
// It is always safest to call createOutputRow() to ensure that your output row's Object[] is large
// enough to handle any new fields you are creating in this step.
//
Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());
//String firstname = get(Fields.In, firstnameField).getString(r);
//String lastname = get(Fields.In, lastnameField).getString(r);
String pageContent = "";
try {
//Create PdfReader instance.
String path = "C:\\Users\\myusername\\Downloads\\myPDF.pdf";
path = path.replace("\\", "/");
PdfReader pdfReader = new PdfReader(path);
//Get the number of pages in pdf.
int pages = pdfReader.getNumberOfPages();
//Iterate the pdf through pages.
for(int i=1; i<=pages; i++) {
//Extract the page content using PdfTextExtractor.
pageContent =
PdfTextExtractor.getTextFromPage(pdfReader, i);
//Print the page content on console.
System.out.println("Content on Page "
+ i + ": " + pageContent);
}
//Close the PdfReader.
pdfReader.close();
// OR JUST PUT THE COLUMN NAME IN THE FLOW;
} catch (Exception e) {
e.printStackTrace();
}
// Set the value in the output field
//
get(Fields.Out, nameField).setValue(outputRow, pageContent);
// putRow will send the row on to the default output hop.
//
putRow(data.outputRowMeta, outputRow);
return true;
}
PDF 内容将在结果字段pageContent中
我的环境:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.