![](/img/trans.png)
[英]Efficiently extract the highlighted portion from PDFs using PyMuPDF python?
[英]How to Data Extract from Unstructured PDFs using PyMuPDF in python?
我正在关注如何使用 PyMuPDF 从非结构化 PDF 中提取数据的指南。
https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/
我收到一个 AttributeError: 'NoneType' object has no attribute 'rect' 错误,当我按照代码操作时,由于我是 Python 的新手,所以不确定发生了什么。
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-7f394b979351> in <module>
1 first_annots=[]
2
----> 3 rec=page1.first_annot.rect
4
5 rec
AttributeError: 'NoneType' object has no attribute 'rect'
代码
import fitz
import pandas as pd
doc = fitz.open('Mansfield--70-21009048 - ConvertToExcel.pdf')
page1 = doc[0]
words = page1.get_text("words")
words[0]
first_annots=[]
rec=page1.first_annot.rect
rec
#Information of words in first object is stored in mywords
mywords = [w for w in words if fitz.Rect(w[:4]) in rec]
ann= make_text(mywords)
first_annots.append(ann)
def make_text(words):
line_dict = {}
words.sort(key=lambda w: w[0])
for w in words:
y1 = round(w[3], 1)
word = w[4]
line = line_dict.get(y1, [])
line.append(word)
line_dict[y1] = line
lines = list(line_dict.items())
lines.sort()
return "n".join([" ".join(line[1]) for line in lines])
print(rec)
print(first_annots)
在这一行之后:
doc = fitz.open('Mansfield--70-21009048 - ConvertToExcel.pdf')
添加此以检查 pdf 中是否有任何注释,您可能最终在 pdf 中根本没有注释,因此您的 page.first_annot 是 NoneType。
如果 doc.has_annots():
print("has annots")
别的:
print("no annots")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.