简体   繁体   中英

How to extract some mathematical expressionfrom pdf using python?

I have a pdf which has math equations like this

I am trying to extract the objective questions from a pdf file and convert them into csv file using python in such a way that each row of table contain a question, four options in each column and a correct option (so total six columns). But that pdf also have those mathematical equations which I can't write them into csv file as they are . Is it possible to write those equations in my csv file as they are in pdf file ?

This depends on how the formula is represented in PDF. It can be either XObject, inline image or unicode text.

Try pdfreader . It can extract plain texts, texts containing PDF commands and images from PDF documents.

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""
images = []
try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        images.extend(viewer.canvas.inline_images)
        images.extend(viewer.canvas.images.values())
        viewer.next()
except PageDoesNotExist:
    pass

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM