Trying to iterate through each line of the page from the PyMuPDF library to check the length of the sentence, if it is less than 10 words then I would like to add a full stop. Psuedo code would be:
#loop through the lines of the PDF
#check number of words in line
#if line has less than 10 words
#add period
Real code below:
import fitz
myfile = "my.pdf"
doc =fitz.open(myfile)
page=doc[0]
for page in doc:
text = page.getText("text")
print(text)
when I add another for loop eg for line in page:
I get an error saying page is not iterable. Is there any other way I can do this?
Thanks
in order to iterate over page lines you can use getDisplayList:
page_display = page.getDisplayList()
dictionary_elements = page_display.getTextPage().extractDICT()
for block in dictionary_elements['blocks']:
for line in block['lines']:
line_text = ''
for span in line['spans']:
line_text += ' ' + span['text]
print(l
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.