简体   繁体   中英

Using python PyMuPDF (fitz) to iterate through lines and check length of it and add a period if it meets the criteria

Trying to iterate through each line of the page from the PyMuPDF library to check the length of the sentence, if it is less than 10 words then I would like to add a full stop. Psuedo code would be:

#loop through the lines of the PDF
#check number of words in line
#if line has less than 10 words 
#add period 

Real code below:

import fitz
myfile = "my.pdf"
doc  =fitz.open(myfile)
page=doc[0]
for page in doc:
    text = page.getText("text")
    print(text)

when I add another for loop eg for line in page:

I get an error saying page is not iterable. Is there any other way I can do this?

Thanks

in order to iterate over page lines you can use getDisplayList:

page_display = page.getDisplayList()
dictionary_elements = page_display.getTextPage().extractDICT()
for block in dictionary_elements['blocks']:
    for line in block['lines']:
        line_text = ''
        for span in line['spans']:
             line_text += ' ' + span['text]
        print(l

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM