[英]Python Memory Leakage (Causing Memory Error): Memory is increasing incrementally even after calling garbage collector and deleting large variable
import numpy as np
import pdfplumber
import os
import psutil
import gc
file = 'path.pdf'
pdf = pdfplumber.open(file)
pages = pdf.pages
print('Total pages in pdf = '+str(len(pages)))
startPage = 3
chunkSize = 50
while(startPage < 250):
print('Iteration')
print('Memory at the start : ',end='')
print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs')
endPage = startPage + chunkSize
extract_pages = pages[startPage: endPage]
print(str(extract_pages[0])," to ",str(extract_pages[-1]))
df = pd.DataFrame()
for page in extract_pages:
df = pd.concat([df,pd.DataFrame(np.array(page.extract_table()))], axis = 0)
del page
# df.to_csv()
del df, extract_pages
gc.collect()
print('Memory at the end : ',end='')
print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs',end='\n\n')
startPage = startPage + chunkSize
print('finished')
OUTPUT:
pdf 中的總頁數 = 17225
迭代
Memory 開頭:818.91015625 MBs
第4頁到第53頁
Memory 最后:819.61328125 MBs
迭代
Memory 開頭:819.61328125 MB
第 54 頁至第 103 頁
Memory 最后:963.703125 MBs
迭代
Memory 開頭:963.703125 MB
第104頁到第153頁
Memory 最后:1324.65625 MBs
迭代
Memory 開始時:1324.65625 MB
第154頁到第203頁
Memory 最后:1686.01171875 MBs
迭代
Memory 開頭:1686.01171875 MBs
第204頁到第253頁
Memory 最后:2047.60546875 MBs
完成的
(從pdf中提取文本)
簽出這個問題。
我按原樣運行您的代碼,有 170 頁,3.1MBs PDF。 結束了:
Memory at the end : 1321.90625 MBs
Finished duration 55.18302297592163 secs.
使用上下文管理器打開 PDF 並使用page.flush_cache()
,我得到:
Memory at the end : 90.8125 MBs
Finished duration 68.00025987625122 secs.
是的,它更慢,正如 github 問題中所說,但至少 memory 泄漏要小得多。
工作片段(我刪除了這個例子的塊):
import numpy as np
import pdfplumber
import os
import psutil
import gc
import pandas as pd
from time import time
start = time()
with pdfplumber.open("file.pdf") as pdf:
pages_len = len(pdf.pages)
print(f'Total pages in pdf = {pages_len}')
with pdfplumber.open("file.pdf") as pdf:
df = pd.DataFrame()
print('Memory at the start : ',end='')
for index, page in enumerate(pdf.pages):
if not index % 10:
print(f'=== Page index {index} === ')
print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs')
table = page.extract_table()
df = pd.concat([df,pd.DataFrame(np.array(table))], axis = 0)
page.flush_cache()
gc.collect()
print('Memory at the end : ',end='')
print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs',end='\n\n')
print(f'Finished duration {time() - start} secs.')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.