new to pdf parsing.
I want to recognize a graph in a pdf file, so I could skip it and not extract this type of text. all I know about the pdf is that it is generated from word (not scanned).
Input - pdf with a graph such as this one. output should be - true or false
pdfplumber recognize tables but doesn't seem to recognize graphs. tried recognizing curves and rectangles but results are not consistent.
maybe there's another way?
Thank you!
(thanks to @KJ comment) I ended up using some bulk estimations to understand if the page contains a graph or not.
If there're more than MIN_RECTS in a page I assume there's a graph there (with columns that precived as rectengels) or if there's more than MIN_CURVES than there's a graph (for me it was 0, but it depends if you have some non-trivial shapes in the header or footer). It's not the best but it works most of the time.
example for some code - using both functions and extract_text() afterwards leads to pretty good results for me.
page = pdfplumber.open("file.pdf").pages[0]
def contains_graphs(page):
return len(page.rects) > MIN_RECTS or len(page.curves) > MIN_CURVES
def only_chars_from_page_filter(page):
return page.filter(lambda obj: obj["object_type"] == "char")
following @G5W's comment, it is possible to convert PDF to MS Word file with pywin32 to read the PDF into Word, then use extract text only with python-docx for example.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.