[英]How to split pdf file by result in java pdfbox
我有一個pdf文件,其中包含60頁。 在每個頁面中,我都使用Apache PDFBOX獨特且重復發票編號。
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
import java.util.regex.*;
public class PDFtest1 {
public static void main(String[] args){
PDDocument pd;
try {
File input = new File("G:\\Sales.pdf");
// StringBuilder to store the extracted text
StringBuilder sb = new StringBuilder();
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
// Add text to the StringBuilder from the PDF
sb.append(stripper.getText(pd));
Pattern p = Pattern.compile("Invoice No.\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d");
// Matcher refers to the actual text where the pattern will be found
Matcher m = p.matcher(sb);
while (m.find()){
// group() method refers to the next number that follows the pattern we have specified.
System.out.println(m.group());
}
if (pd != null) {
pd.close();
}
} catch (Exception e){
e.printStackTrace();
}
}
}
我可以使用java regex閱讀所有發票編號。 最后結果如下
run:
Invoice No. D0000003010
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003013
Invoice No. D0000003013
Invoice No. D0000003014
Invoice No. D0000003014
Invoice No. D0000003015
Invoice No. D0000003016
我需要根據發票編號拆分pdf。 例如發票編號D0000003011,所有pdf頁面應合並為單個pdf,依此類推。 我能否實現這一目標。 ..
public static void main(String[] args) throws IOException, COSVisitorException
{
File input = new File("G:\\Sales.pdf");
PDDocument outputDocument = null;
PDDocument inputDocument = PDDocument.loadNonSeq(input, null);
PDFTextStripper stripper = new PDFTextStripper();
String currentNo = null;
for (int page = 1; page <= inputDocument.getNumberOfPages(); ++page)
{
stripper.setStartPage(page);
stripper.setEndPage(page);
String text = stripper.getText(inputDocument);
Pattern p = Pattern.compile("Invoice No.(\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d)");
// Matcher refers to the actual text where the pattern will be found
Matcher m = p.matcher(text);
String no = null;
if (m.find())
{
no = m.group(1);
}
System.out.println("page: " + page + ", value: " + no);
PDPage pdPage = (PDPage) inputDocument.getDocumentCatalog().getAllPages().get(page - 1);
if (no != null && !no.equals(currentNo))
{
saveCloseCurrent(currentNo, outputDocument);
// create new document
outputDocument = new PDDocument();
currentNo = no;
}
if (no == null && currentNo == null)
{
System.out.println ("header page ??? " + page + " skipped");
continue;
}
// append page to current document
outputDocument.importPage(pdPage);
}
saveCloseCurrent(currentNo, outputDocument);
inputDocument.close();
}
private static void saveCloseCurrent(String currentNo, PDDocument outputDocument)
throws IOException, COSVisitorException
{
// save to new output file
if (currentNo != null)
{
// save document into file
File f = new File(currentNo + ".pdf");
if (f.exists())
{
System.err.println("File " + f + " exists?!");
System.exit(-1);
}
outputDocument.save(f);
outputDocument.close();
}
}
謹防:
更新19.8.2015:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.