简体   繁体   中英

Memory Leak while using elasticsearch parallel_bulk in python

I have small jsonl files which are read in a loop and ingested on elasticsearch. The python process seems to keep increasing memory usage. The code below is running in a class

    def load_files_to_es(self, files_to_load):
        count = 0
        for file in files_to_load:
            if file.endswith(".jsonl"):
                with open(os.path.join("../pdl_out", file)) as clean_data_file:
                    try:
                        clean_data = json.load(clean_data_file)
                        count += 1
                    except Exception as e:
                        logging.error(f"{e} error processing {file}")
                    else:
                        logging.info("loading data to ES")
                        les.load_pdl_to_es(clean_data=clean_data, filename=file)
                    finally:
                        print(f"Prev File: {file}")
                        if count % 10 == 0:
                            gc.collect()

the code to upload on elasticsearch:

es = Elasticsearch(['localhost:9200'], http_auth=None, scheme="http", port=9200)
def insert_data(data_to_insert):
    data = pd.read_json(data_to_insert, orient='records', lines=True)
    for index, row in data.iterrows():
        data_as_json = row.to_json()
        yield {
            "_index": indexname,
            "_id": row['id'],
            "_source": data_as_json
        }


def load_pdl_to_es(clean_data=None, filename=''):
    try:
        for success, info in parallel_bulk(es, insert_data(data_to_insert=clean_data), request_timeout=30,
                                           queue_size=8, thread_count=8):
            if not success:
                logging.debug(info)
                logging.error(f"Insert records to elastic search failed for ${filename}")
    except ConnectionError:
        logging.error("Connection error")
    except TimeoutError:
        logging.error("Connection Timed out")
    except Exception as e:
        logging.error(e)

Figured out the error. The Pandas dataframe causes the memory leak. The object remains in memory even after its not referenced anymore. https://github.com/pandas-dev/pandas/issues/2659

Solution: Remove the pandas completely and used json objects directly.

Other option would be to use gc.collect() to invoke the garbage collector manually to remove the old unreferenced objects (specifically pd dfs)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM