[英]Why is running this python script taking all my disk space?
我正在运行一个 python 脚本,您可以在下面看到以供参考。 该脚本使用 pytesseract 将从 pdf 获得的图像中的文本转换为包含文本作为字符串以及页码等的 json 文件。只有在我重新启动计算机后才被释放。 举个例子,我的电脑现在还剩 20GB,但是在运行脚本一段时间后,磁盘已满,我不知道为什么会这样。 如果局部变量正在使用它,我曾尝试使用 'del' 来释放空间,还尝试使用 gc.collect() 来强制释放该空间,但没有任何效果。 我做错了什么,我该如何改进?
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc
import json
import uuid
import gc
def generate_id(code):
increment_no = str(uuid.uuid4().int)[5:12]
_id = code + increment_no
return _id
def pdf_to_json(pdf_path):
"""This function takes in the path of pdf to generate a json object with the following attributes"""
"""Company (Name of company), id (Unique Id), Page_*No. (Example Page_1, Page_2 etc.) with each page containing text in that speicifc pdf page"""
data = {}
pdf=wi(filename=pdf_path,resolution=300)
data['company'] = str(pdf_path.split('/')[-1:][0])
countrycode = str(pdf_path.split('/')[-2:-1][0].split('_')[0:1][0])
data['id'] = generate_id(countrycode)
pdfImg=pdf.convert('jpeg')
del pdf
gc.collect()
imgBlobs=[]
for img in pdfImg.sequence:
page=wi(image=img)
gc.collect()
imgBlobs.append(page.make_blob('jpeg'))
del page
gc.collect()
del pdfImg
gc.collect()
i=1
Pages = []
for imgBlob in imgBlobs:
im=Image.open(io.BytesIO(imgBlob))
text=pytesseract.image_to_string(im,lang='eng')
Pages.append(text)
del text
gc.collect()
im.close()
del im
gc.collect()
del imgBlobs
gc.collect()
data['Pages'] = Pages
with open('/Users/rishabh/Desktop/CyberBoxer/hawaii_pdf/'+data['id']+'.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
del data
gc.collect()
del Pages
gc.collect()
from os import listdir
onlyfiles = [f for f in listdir('/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/')]
j=1
for i in onlyfiles:
if '.pdf' in i:
start = time.time()
pdf_path = '/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/'+i
pdf_to_json(pdf_path)
print(j)
j+=1
end = time.time()
print(end-start)
gc.collect()```
我想出了为什么会这样,这是因为 python 中的 wand Image 模块,我不得不销毁通过 'del' 或 gc.collect() 获得的对象,因为 wand image 有它自己的销毁方法。
这是相同的更新功能:
"""This function takes in the path of pdf to generate a json object with the following attributes"""
"""Company (Name of company), id (Unique Id), Page_*No. (Example Page_1, Page_2 etc.) with each page containing text in that speicifc pdf page"""
data = {}
#pdf=wi(filename=pdf_path,resolution=300)
data['company'] = str(pdf_path.split('/')[-1:][0])
countrycode = str(pdf_path.split('/')[-2:-1][0].split('_')[0:1][0])
data['id'] = generate_id(countrycode)
#pdfImg=pdf.convert('jpeg')
#del pdf
#gc.collect()
#imgBlobs=[]
#for img in pdfImg.sequence:
# page=wi(image=img)
# gc.collect()
# imgBlobs.append(page.make_blob('jpeg'))
# del page
# gc.collect()
req_image = []
with WI(filename=pdf_path, resolution=150) as image_jpeg:
image_jpeg.compression_quality = 99
image_jpeg = image_jpeg.convert('jpeg')
for img in image_jpeg.sequence:
with WI(image=img) as img_page:
req_image.append(img_page.make_blob('jpeg'))
image_jpeg.destroy()
i=1
Pages = []
for imgBlob in req_image:
im=Image.open(io.BytesIO(imgBlob))
text=pytesseract.image_to_string(im,lang='eng')
Pages.append(text)
im.close()
del im
data['Pages'] = Pages
with open('/Users/rishabh/Desktop/CyberBoxer/iowa_pdf/'+data['id']+'.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)```
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.