Extract text from pdf to file

Question

This question is probably a duplicate, but none of the answers in similar questions helped me. I'm looking for a simple way to extract text from a pdf file into any other type of file or structure which will let me use it. the text I want to extract appears on pages 78-79. At the end of the processes, I want to write each cell from the table in different rows in a .txt file. for example, I want to turn the first row in the table from this:

to this:

0x00
Channel standby
CH_7
CH_6
CH_5
CH_4
CH_3
CH_2
CH_1
CH_0
0x00
RW

I'm using Visual Studio 2017 but I can also work on Pycharm instead.

I've tried using all the options suggested in this question and here

but I'm having problems installing the required libraries on windows 10 OS. I'm also not sure whether those libraries are still in use and supported. I'd appreciate it if anyone could refer me to some updated material on this subject or refer me to the relevant library.

Thank you.

Answer 1

Here's something using PyMuPDF ( pip install pymupdf ).

In this example, get_document_bytes simply makes a request the PDF resource at the URL you provided (using the third-party requests module), and returns the PDF bytes. We use the bytes in main to create a fitz.Document instance via the stream parameter. You could also just download the PDF file manually and provide a filename instead of a stream of bytes, but I didn't feel like doing that. We grab a specific page from the document and print all the text on that page:

def get_document_bytes():
    import requests

    url = "https://www.mouser.co.il/datasheet/2/609/AD7768-7768-4-1502035.pdf"

    headers = {
        "user-agent": "Mozilla/5.0"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return response.content


def main():

    import fitz

    desired_page = 78

    doc = fitz.Document(stream=get_document_bytes(), filetype="PDF")
    page = doc.loadPage(page_id=desired_page-1)

    print(page.getText())
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

AD7768/AD7768-4 
Data Sheet
 
Rev. B | Page 78 of 105 
AD7768 REGISTER MAP DETAILS (SPI CONTROL) 
AD7768 REGISTER MAP 
See Table 63 and the AD7768-4 Register Map Details (SPI Control) section for the AD7768-4 register map and register functions. 
Table 37. Detailed AD7768 Register Map  
Reg. 
Name 
Bit 7 
Bit 6 
Bit 5 
Bit 4 
Bit 3 
Bit 2 
Bit 1 
Bit 0 
Reset RW 
0x00 
Channel standby 
CH_7 
CH_6 
CH_5 
CH_4 
CH_3 
CH_2 
CH_1 
CH_0 
0x00 
RW 
...

I realize you want the text from two pages, not just one - and you also don't want all the text from these pages, just the stuff that's in the table. This is just to get you started - I may tinker around with this a bit more, and update my post later.

Extract text from pdf to file

Question

1 answers

solution1
1 2020-08-30 11:12:46

Extract text from pdf to file

Question

1 answers

solution1 1 2020-08-30 11:12:46

solution1
1 2020-08-30 11:12:46