简体   繁体   中英

How to get a String between two Strings with different indexes in each pdf file using Java

在此处输入图像描述

I have more than 200 pdf report files which I need to get the VIN# and the Case Number from each report and then rename the report with the VIN + Case#.pdf.

As of the VIN#, it was easy to get it since it is always located in the beginning of the page and the VIN has a fix length which is 17 characters.

I'm having an issue with the Case Number where I can not get the exact number as the Index of the "Case Number" gets changes from a report to another based on the number of words in each cell which comes before the "Case Number"'s cell.

My question is: How can I tell java to give me the String that comes between the two spaces which one of them comes after the "Case Number" and the second one comes before the cell "System Key"

I tried to split all the words by the spaces and I get stuck with the logic of how to really get that specific number despite its index number.

NOTE: The Case Number is Always Different and the length of it is also not the same

Here is what I have so far:

    package Read_Pdf_AsA_Text;
    import java.io.File;
    import java.io.IOException;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.text.PDFTextStripper;

    public class GetVinAndCaseNum {

    public static void main(String args[]) throws IOException {

    File folder = new File("C:\\Users\\" + System.getProperty("user.name") + "\\Desktop\\Tasks\\test\\");
    File[] listOfFiles = folder.listFiles();
         for (int i = 0; i < listOfFiles.length; i++) {

        if (listOfFiles[i].isFile()) {
                File f = new File("C:\\Users\\" + System.getProperty("user.name") + "\\Desktop\\Tasks\\test\\"+listOfFiles[i].getName()); 

    PDDocument document = PDDocument.load(f);
    PDFTextStripper pdfStripper = new PDFTextStripper(); // Instantiate PDFTextStripper class
    String text = pdfStripper.getText(document); // Retrieving text from PDF document
    System.out.println(text);

if (text.contains("VIN")) {
                int vinIndexIs = text.indexOf("VIN");
                int newVINIndex = vinIndexIs + 3;
                String vinNum = text.substring(newVINIndex, newVINIndex + 19);
                System.err.println("New VIN is ===> " + vinNum);



        }



            int caseNo = 0;
                 if (text != null) {
                        String[] spcase = text.split(" ");
                        System.out.println("spaces ==> " + spcase);
                        boolean foundCaseNumber = false;
                        for (String stringAfterSpace : spcase) {
                            System.out.println("stringAfterSpace ==>  " + stringAfterSpace);

            if(foundCaseNumber) {


          caseNo = Integer.parseInt(stringAfterSpace.trim());
            System.out.println("caseNo ==> " + caseNo);
                    break;
                            }
                            if("Case Number".equals(stringAfterSpace)) {
                                System.out.println("Case Number issss ===> " + stringAfterSpace);
                                foundCaseNumber = true;

                            }
                        }
                        if(caseNo == 0) {
                            System.out.println("Case No. not found.");
                        }
                 } 


                document.close();

                System.out.println("conversion is done");
            }
        }
        }
}
/*
 * import java.util.regex.Pattern;
 * import java.util.regex.Matcher;
 */
String text = pdfStripper.getText(document); // Retrieving text from PDF document
Pattern pattern = Pattern.compile("Case Number\\s+(\\d+)\\s"); // this is the regex
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
    System.err.println(""Case Number is ==> " + matcher.group(1));
}

Parts of the regex from the above code:

  1. Case Number - a literal, ie search for this, exact string.
  2. \\s+ - one or more consecutive spaces
  3. \\d+ - one or more digits
  4. \\s - a single space

So the above code searches the text extracted from your PDF document for the string Case Number followed by one or more spaces, followed by a number.

If the regex is found, just the number is extracted via the code matcher.find(1) .

Refer to this Web page:

https://docs.oracle.com/javase/tutorial/essential/regex/

I was able to find a solution which is as following: I split the text with spaces, and then replaced the Case Number with CaseNumber so I could get red off of the space which comes between the words "Case" and "Number". Then I did some logic as follows:

            String caseNum = "";
             if (text != null) {
                    String[] spcase = text.replace("Case Number", "CaseNumber").split(" ");
                    boolean foundCaseNum = false;
                    for (String stringAfterSpace : spcase) {

                        if(foundCaseNum) {

                            caseNum = stringAfterSpace.trim();
                            System.err.println("Case Number is ==> " + caseNum);
                            break;
                        }
                        if(stringAfterSpace.contains("CaseNumber")) {
                            foundCaseNum = true;

                        }
                    }
                    if(caseNum.isEmpty()) {
                        System.out.println("Case No. not found.");
                    }
             } 

            document.close();

            System.out.println("conversion is done");
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM