![](/img/trans.png)
[英]Elasticsearch parallel_bulk helper function in python is throwing error for a partial failure when consuming response
[英]Memory Leak while using elasticsearch parallel_bulk in python
我有一些小的 jsonl 文件,這些文件在循環中讀取並在 elasticsearch 上攝取。python 進程似乎在不斷增加 memory 的使用。 下面的代碼運行在 class
def load_files_to_es(self, files_to_load):
count = 0
for file in files_to_load:
if file.endswith(".jsonl"):
with open(os.path.join("../pdl_out", file)) as clean_data_file:
try:
clean_data = json.load(clean_data_file)
count += 1
except Exception as e:
logging.error(f"{e} error processing {file}")
else:
logging.info("loading data to ES")
les.load_pdl_to_es(clean_data=clean_data, filename=file)
finally:
print(f"Prev File: {file}")
if count % 10 == 0:
gc.collect()
上傳至 elasticsearch 的代碼:
es = Elasticsearch(['localhost:9200'], http_auth=None, scheme="http", port=9200)
def insert_data(data_to_insert):
data = pd.read_json(data_to_insert, orient='records', lines=True)
for index, row in data.iterrows():
data_as_json = row.to_json()
yield {
"_index": indexname,
"_id": row['id'],
"_source": data_as_json
}
def load_pdl_to_es(clean_data=None, filename=''):
try:
for success, info in parallel_bulk(es, insert_data(data_to_insert=clean_data), request_timeout=30,
queue_size=8, thread_count=8):
if not success:
logging.debug(info)
logging.error(f"Insert records to elastic search failed for ${filename}")
except ConnectionError:
logging.error("Connection error")
except TimeoutError:
logging.error("Connection Timed out")
except Exception as e:
logging.error(e)
找出錯誤。 Pandas dataframe 導致 memory 泄漏。 即使不再引用 object,它仍保留在 memory 中。 https://github.com/pandas-dev/pandas/issues/2659
解決方法:把pandas徹底去掉,直接使用json對象。
其他選擇是使用 gc.collect() 手動調用垃圾收集器以刪除舊的未引用對象(特別是 pd dfs)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.