Memory 在 python 中使用 elasticsearch parallel_bulk 時發生泄漏

Question

我有一些小的 jsonl 文件，這些文件在循環中讀取並在 elasticsearch 上攝取。python 進程似乎在不斷增加 memory 的使用。 下面的代碼運行在 class

    def load_files_to_es(self, files_to_load):
        count = 0
        for file in files_to_load:
            if file.endswith(".jsonl"):
                with open(os.path.join("../pdl_out", file)) as clean_data_file:
                    try:
                        clean_data = json.load(clean_data_file)
                        count += 1
                    except Exception as e:
                        logging.error(f"{e} error processing {file}")
                    else:
                        logging.info("loading data to ES")
                        les.load_pdl_to_es(clean_data=clean_data, filename=file)
                    finally:
                        print(f"Prev File: {file}")
                        if count % 10 == 0:
                            gc.collect()

上傳至 elasticsearch 的代碼：

es = Elasticsearch(['localhost:9200'], http_auth=None, scheme="http", port=9200)
def insert_data(data_to_insert):
    data = pd.read_json(data_to_insert, orient='records', lines=True)
    for index, row in data.iterrows():
        data_as_json = row.to_json()
        yield {
            "_index": indexname,
            "_id": row['id'],
            "_source": data_as_json
        }


def load_pdl_to_es(clean_data=None, filename=''):
    try:
        for success, info in parallel_bulk(es, insert_data(data_to_insert=clean_data), request_timeout=30,
                                           queue_size=8, thread_count=8):
            if not success:
                logging.debug(info)
                logging.error(f"Insert records to elastic search failed for ${filename}")
    except ConnectionError:
        logging.error("Connection error")
    except TimeoutError:
        logging.error("Connection Timed out")
    except Exception as e:
        logging.error(e)

Answer 1

找出錯誤。 Pandas dataframe 導致 memory 泄漏。 即使不再引用 object，它仍保留在 memory 中。 https://github.com/pandas-dev/pandas/issues/2659

解決方法：把pandas徹底去掉，直接使用json對象。

其他選擇是使用 gc.collect() 手動調用垃圾收集器以刪除舊的未引用對象（特別是 pd dfs）

Memory 在 python 中使用 elasticsearch parallel_bulk 時發生泄漏

問題描述

1 個解決方案

解決方案1
0 2020-09-16 15:26:45

Memory 在 python 中使用 elasticsearch parallel_bulk 時發生泄漏

問題描述

1 個解決方案

解決方案1 0 2020-09-16 15:26:45

解決方案1
0 2020-09-16 15:26:45