简体   繁体   中英

PDFQuery: get Page number where element is located

This is the first time i use PDFQuery to scrape PDF's.

What i need to do is to get the prices from a price list with several pages, i want to give the product code to PDFQuery, and it should find the code and return the price next to it. The problem is that using the very first example on the Github page gets the location of the text but it clearly says "Note that we don't have to know where the name is on the page, or what page it's on". Thats the case with my price list, but then all the other examples specify the page number ( LTPage[pageid=1] ), but i don't see where we get the page number.

And if I don't specify the page number it returns ALL the texts in the same location for ALL the pages.

Also, I added an exactText function because the codes could be, for example, "92005", "92005C", "92005G", so using :contains alone doesn't help much.

I've tried selecting the page where the element is located, and using JQuery .closest , both with no luck.

I checked thePDFMiner documentation and PyQuery documentation but i see nothing that helps me =(

My code looks like this right now:

import pdfquery

pdf = pdfquery.PDFQuery("tests/samples/priceList.pdf")
pdf.load()

code = "92005G"

def exactText():
    element = str(vars(this))
    text = str("u'" + code + "\\n'")
    if text in element:
        return True
    return False

#This should work if i could select the page where the element is located
#page = pdf.pq('LTPage:contains("'+code+'")')
#pageNum = page.attr('pageid')

#Here I would replace the "8" with the page number i get, or remove the LTPage 
#selector all together if i need to find the element first and then the page
label = pdf.pq('LTPage[page_index="8"] LTTextLineHorizontal:contains("'+code+'")').filter(exactText)

#Since we could use "JQuery selectors" i tried using ".closest", but it returns nothing
#page = label.closest('LTPage')
#pageNum = page.attr('pageid')

left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))

#Here I would replace the "8" with the page number i get
price = pdf.pq('LTPage[page_index="8"] LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (left_corner+110, bottom_corner, left_corner+140,     bottom_corner+20)).text()
print price

Any help is very appreciated, guys and girls!!!

There may be a more elegant way, but what I used to find the page an element is on is .interancestors('LTPage'). Example code below will find all the instances of "My Text" and tell you what page it is on:

for pq in pdf.pq('LTTextLineHorizontal:contains("My Text")'):
    page_pq = pq.iterancestors('LTPage').next()   # Use just the first ancestor
    print 'Found the text "%s" on page %s' % ( pq.layout.get_text(), page_pq.layout.pageid)

I hope that helps! :)

This should work in python3 (note calling next(iterator) to get the first page-ancestor):

code = "92005G"

label = pdf.pq('LTPage:contains("{}")'.format(code))
page_pq = next(label.iterancestors('LTPage'))
pageNum = int(page_pq.layout.pageid)

label = pdf.pq('LTPage[page_index="{0}"] LTTextLineHorizontal:contains("{1}")'.format(pageNum, code)).filter(exactText)

left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))

price = pdf.pq('LTPage[page_index="{0}"] LTTextLineHorizontal:in_bbox("{1}, {2}, {3}, {4}")'.format(pageNum, left_corner+110, bottom_corner, left_corner+140, bottom_corner+20)).text()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM