簡體   English   中英

Python Memory Leakage (Causing Memory Error): Memory is increasing incrementally even after calling garbage collector and deleting large variable

[英]Python Memory Leakage (Causing Memory Error): Memory is increasing incrementally even after calling garbage collector and deleting large variable

import numpy as np
import pdfplumber
import os
import psutil
import gc

file = 'path.pdf'
pdf = pdfplumber.open(file)
pages = pdf.pages
print('Total pages in pdf = '+str(len(pages)))

startPage = 3
chunkSize = 50

while(startPage < 250):
    print('Iteration')
    print('Memory at the start : ',end='')
    print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs')

    endPage = startPage + chunkSize
    extract_pages = pages[startPage: endPage] 
    print(str(extract_pages[0])," to ",str(extract_pages[-1]))
    
    df = pd.DataFrame()
    for page in extract_pages:
        df = pd.concat([df,pd.DataFrame(np.array(page.extract_table()))], axis = 0)
        del page
        
#   df.to_csv()
    del df, extract_pages
    gc.collect()
    print('Memory at the end : ',end='')
    print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs',end='\n\n')
    
    startPage = startPage + chunkSize 

print('finished')

OUTPUT:
pdf 中的總頁數 = 17225
迭代
Memory 開頭:818.91015625 MBs
第4頁到第53頁
Memory 最后:819.61328125 MBs

迭代
Memory 開頭:819.61328125 MB
第 54 頁至第 103 頁
Memory 最后:963.703125 MBs

迭代
Memory 開頭:963.703125 MB
第104頁到第153頁
Memory 最后:1324.65625 MBs

迭代
Memory 開始時:1324.65625 MB
第154頁到第203頁
Memory 最后:1686.01171875 MBs

迭代
Memory 開頭:1686.01171875 MBs
第204頁到第253頁
Memory 最后:2047.60546875 MBs

完成的

(從pdf中提取文本)

簽出這個問題

我按原樣運行您的代碼,有 170 頁,3.1MBs PDF。 結束了:

Memory at the end : 1321.90625  MBs

Finished duration 55.18302297592163 secs.

使用上下文管理器打開 PDF 並使用page.flush_cache() ,我得到:

Memory at the end : 90.8125  MBs

Finished duration 68.00025987625122 secs.

是的,它更慢,正如 github 問題中所說,但至少 memory 泄漏要小得多。

工作片段(我刪除了這個例子的塊):

import numpy as np
import pdfplumber
import os
import psutil
import gc
import pandas as pd
from time import time

start = time()

with pdfplumber.open("file.pdf") as pdf:
    pages_len = len(pdf.pages)

print(f'Total pages in pdf = {pages_len}')

with pdfplumber.open("file.pdf") as pdf:
    df = pd.DataFrame()
    print('Memory at the start : ',end='')
    for index, page in enumerate(pdf.pages):
        if not index % 10:
            print(f'=== Page index {index} === ')

        print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs')
        
        table = page.extract_table()
        df = pd.concat([df,pd.DataFrame(np.array(table))], axis = 0)
        page.flush_cache()

    gc.collect()
    print('Memory at the end : ',end='')
    print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs',end='\n\n')


print(f'Finished duration {time() - start} secs.')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM