简体   繁体   English

使用 python PyMuPDF (fitz) 遍历行并检查它的长度,如果满足条件则添加一个句点

[英]Using python PyMuPDF (fitz) to iterate through lines and check length of it and add a period if it meets the criteria

Trying to iterate through each line of the page from the PyMuPDF library to check the length of the sentence, if it is less than 10 words then I would like to add a full stop.尝试遍历 PyMuPDF 库中页面的每一行以检查句子的长度,如果少于 10 个单词,那么我想添加一个句号。 Psuedo code would be:伪代码将是:

#loop through the lines of the PDF
#check number of words in line
#if line has less than 10 words 
#add period 

Real code below:真实代码如下:

import fitz
myfile = "my.pdf"
doc  =fitz.open(myfile)
page=doc[0]
for page in doc:
    text = page.getText("text")
    print(text)

when I add another for loop eg for line in page:当我for line in page:添加另一个 for 循环例如for line in page:

I get an error saying page is not iterable.我收到一条错误消息,说页面不可迭代。 Is there any other way I can do this?有没有其他方法可以做到这一点?

Thanks谢谢

in order to iterate over page lines you can use getDisplayList:为了遍历页面行,您可以使用 getDisplayList:

page_display = page.getDisplayList()
dictionary_elements = page_display.getTextPage().extractDICT()
for block in dictionary_elements['blocks']:
    for line in block['lines']:
        line_text = ''
        for span in line['spans']:
             line_text += ' ' + span['text]
        print(l

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM