简体   繁体   中英

how to extract x0, y0 coordinates of input field in pdf

I want to scrape a pdf document and I want the coordinates of input fields (the bottom left corner point of the text field). Is there a way to accomplish that using some python library like pyPDF2 or pdfMiner? the following images may help understand the problem

看图片

Usually, such fields are either a repetition of periods or underscores. You can extract the textlines of the pdf file using PyMuPDF and use a regex expression ( import re ) to identify such repetitions and then save the coordinates to a list or something similar whenever a match is identified.

The code below does this except it saves (x0,y0,x1,y1) as the coordinates of the bottom left corner (x0,y0) and the top right corner (x1,y1) - you can extract the ones you need.

    def whichFields(self, txtline):
        reg = re.compile(r"(…|\..)\1+")
        self.matches.append(reg.finditer(txtline))
        return self.matches

    # Uses PyMuPDF to find box coordinates of the fields in matches[]
    # returns a list of the coordinates in the order which they
    # appear in matches[].
    def whereFields(self):
        global c
        count = 0
        for page in self.doc:
            field_areas = []
            c = self.newCanvas(count)
            page_num = count
            count += 1
            mts = []
            txtlines = page.getText("text").split("\n")  # using doc opened in fitz, splitting all text lines in page
            prev_area = []
            for j in txtlines:
                mts.append(self.whichFields(j))

            # These for loops access the result of the regex search and then ultimately pass
            # the matching strings to searchFor() which returns a list of coordinates of the
            # rectangles in which the searched "fields" are found.
            for data in mts:
                for match in data:
                    for i in match:
                        # extracts the matching string and searches for its rect coordinates.
                        self.areas = page.searchFor(i[1])
                        for area in self.areas:
                            field_areas.append(area)
`

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM