Python Memory Leakage (Causing Memory Error): Memory is increasing incrementally even after calling garbage collector and deleting large variable

Question

import numpy as np
import pdfplumber
import os
import psutil
import gc

file = 'path.pdf'
pdf = pdfplumber.open(file)
pages = pdf.pages
print('Total pages in pdf = '+str(len(pages)))

startPage = 3
chunkSize = 50

while(startPage < 250):
    print('Iteration')
    print('Memory at the start : ',end='')
    print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs')

    endPage = startPage + chunkSize
    extract_pages = pages[startPage: endPage] 
    print(str(extract_pages[0])," to ",str(extract_pages[-1]))
    
    df = pd.DataFrame()
    for page in extract_pages:
        df = pd.concat([df,pd.DataFrame(np.array(page.extract_table()))], axis = 0)
        del page
        
#   df.to_csv()
    del df, extract_pages
    gc.collect()
    print('Memory at the end : ',end='')
    print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs',end='\n\n')
    
    startPage = startPage + chunkSize 

print('finished')

OUTPUT：
pdf 中的總頁數 = 17225
迭代
Memory 開頭：818.91015625 MBs
第4頁到第53頁
Memory 最后：819.61328125 MBs

迭代
Memory 開頭：819.61328125 MB
第 54 頁至第 103 頁
Memory 最后：963.703125 MBs

迭代
Memory 開頭：963.703125 MB
第104頁到第153頁
Memory 最后：1324.65625 MBs

迭代
Memory 開始時：1324.65625 MB
第154頁到第203頁
Memory 最后：1686.01171875 MBs

迭代
Memory 開頭：1686.01171875 MBs
第204頁到第253頁
Memory 最后：2047.60546875 MBs

完成的

（從pdf中提取文本）

Answer 1

簽出這個問題。

我按原樣運行您的代碼，有 170 頁，3.1MBs PDF。 結束了：

Memory at the end : 1321.90625  MBs

Finished duration 55.18302297592163 secs.

使用上下文管理器打開 PDF 並使用page.flush_cache() ，我得到：

Memory at the end : 90.8125  MBs

Finished duration 68.00025987625122 secs.

是的，它更慢，正如 github 問題中所說，但至少 memory 泄漏要小得多。

工作片段（我刪除了這個例子的塊）：

import numpy as np
import pdfplumber
import os
import psutil
import gc
import pandas as pd
from time import time

start = time()

with pdfplumber.open("file.pdf") as pdf:
    pages_len = len(pdf.pages)

print(f'Total pages in pdf = {pages_len}')

with pdfplumber.open("file.pdf") as pdf:
    df = pd.DataFrame()
    print('Memory at the start : ',end='')
    for index, page in enumerate(pdf.pages):
        if not index % 10:
            print(f'=== Page index {index} === ')

        print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs')
        
        table = page.extract_table()
        df = pd.concat([df,pd.DataFrame(np.array(table))], axis = 0)
        page.flush_cache()

    gc.collect()
    print('Memory at the end : ',end='')
    print((psutil.Process(os.getpid()).memory_info().rss)/(1024 * 1024),' MBs',end='\n\n')


print(f'Finished duration {time() - start} secs.')

Python Memory Leakage (Causing Memory Error): Memory is increasing incrementally even after calling garbage collector and deleting large variable

問題描述

1 個解決方案

解決方案1
0 2022-08-01 14:31:38

Python Memory Leakage (Causing Memory Error): Memory is increasing incrementally even after calling garbage collector and deleting large variable

問題描述

1 個解決方案

解決方案1 0 2022-08-01 14:31:38

解決方案1
0 2022-08-01 14:31:38