简体   繁体   中英

how to recognize a graph in pdf using python?

new to pdf parsing.

I want to recognize a graph in a pdf file, so I could skip it and not extract this type of text. all I know about the pdf is that it is generated from word (not scanned).

Input - pdf with a graph such as this one. output should be - true or false

pdfplumber recognize tables but doesn't seem to recognize graphs. tried recognizing curves and rectangles but results are not consistent.

maybe there's another way?

Thank you!

option 1:

(thanks to @KJ comment) I ended up using some bulk estimations to understand if the page contains a graph or not.

If there're more than MIN_RECTS in a page I assume there's a graph there (with columns that precived as rectengels) or if there's more than MIN_CURVES than there's a graph (for me it was 0, but it depends if you have some non-trivial shapes in the header or footer). It's not the best but it works most of the time.

example for some code - using both functions and extract_text() afterwards leads to pretty good results for me.

page = pdfplumber.open("file.pdf").pages[0]

def contains_graphs(page):
  return len(page.rects) > MIN_RECTS or len(page.curves) > MIN_CURVES 

def only_chars_from_page_filter(page):
  return page.filter(lambda obj: obj["object_type"] == "char")

option 2:

following @G5W's comment, it is possible to convert PDF to MS Word file with pywin32 to read the PDF into Word, then use extract text only with python-docx for example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM