使用Java讀取pdf文件中的表格或單元格值？

Question

我已經通過Java和PDF論壇從pdf文件的表格中提取文本值，但是除了JPedal （它不是開源和許可的）之外，找不到任何解決方案。

因此，我想知道像pdfbox之類的任何開源API，itext都能達到與JPedal相同的結果。

參考。 例：

樣品表

Answer 1

在評論中，OP澄清說他在表格中找到了要提取的pdf文件中的文本值

通過提供X和Y坐標

因此，盡管這個問題最初聽起來像是從PDF中一般地提取表格數據（至少可能很困難），但實際上它實際上是關於從坐標給定的頁面上的矩形區域中提取文本。

可以使用您提到的任何一個庫（當然也可以是其他庫）來實現。

iText的

要限制要從中提取文本的區域，可以在FilteredTextRenderListener使用RegionTextRenderFilter ，例如：

/**
 * Parses a specific area of a PDF to a plain text file.
 * @param pdf the original PDF
 * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));
    Rectangle rect = new Rectangle(70, 80, 490, 580);
    RenderFilter filter = new RegionTextRenderFilter(rect);
    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
    }
    out.flush();
    out.close();
    reader.close();
}

（來自iText in Action，第二版的ExtractPageContentArea示例）

但是請注意，iText會基於內容流中的基本文本塊提取文本，而不是基於此類塊中的每個單個字形提取文本。 因此，如果只有該區域中的最小部分，則整個塊都將被處理。

這可能不適合您。

如果遇到提取過多的問題，則應事先將這些塊拆分成它們的構成字形。 這個stackoverflow答案解釋了如何做到這一點。

PDFBox的

要限制要從中提取文本的區域，可以使用PDFTextStripperByArea ，例如：

PDDocument document = PDDocument.load( args[0] );
if( document.isEncrypted() )
{
    document.decrypt( "" );
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 10, 280, 275, 60 );
stripper.addRegion( "class1", rect );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );

（PDFBox 1.8.8示例中的ExtractTextByArea ）

Answer 2

試試PDFTextStream 。 至少我能夠識別列值。 之前，我使用iText並陷入了定義策略的困境。 這個很難（硬。

該API通過放置更多空格來分隔列單元格。 它是固定的。 你可以放邏輯。 （iText中缺少此功能）。

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class PDFText {
    public static void main(String[] args) throws java.io.IOException {
        String pdfFilePath = "xyz.pdf";

        Document pdf = PDF.open(pdfFilePath);
        StringBuilder text = new StringBuilder(1024);
        pdf.pipe(new OutputTarget(text));
        pdf.close();
        System.out.println(text);
   }
}

在stackoverflow上已提出與此相關的問題！

使用Java讀取pdf文件中的表格或單元格值？

問題描述

2 個解決方案

解決方案1
5 2015-02-03 09:19:53

iText的

PDFBox的

解決方案2
0 2016-10-06 16:45:29

使用Java讀取pdf文件中的表格或單元格值？

問題描述

2 個解決方案

解決方案1 5 2015-02-03 09:19:53

iText的

PDFBox的

解決方案2 0 2016-10-06 16:45:29

解決方案1
5 2015-02-03 09:19:53

解決方案2
0 2016-10-06 16:45:29