简体   繁体   English

导入大型CSV文件时出现Neo4j / Py2Neo超时问题

[英]Neo4j/Py2Neo timeout issue when importing large CSV files

When importing data from large CSV files (>200MB) into Neo4j, the response ends up hanging. 大型CSV文件(> 200MB)中的数据导入Neo4j时,响应最终会挂起。 The query does complete , and all records are imported, however there seems to be some sort of response timeout which results in no indication that the import query has completed. 查询确实完成 ,并且导入了所有记录,但是似乎存在某种响应超时,这导致没有指示导入查询已完成。 This is an issue as we cannot automate importing multiple files into Neo4j, since the script continues waiting for the query to finish, even though it already has. 这是一个问题,因为我们无法自动将多个文件导入Neo4j,因为脚本会继续等待查询完成,即使它已经存在。

Importing 1 file takes around 10-15 minutes. 导入1个文件大约需要10-15分钟。

No errors are thrown anywhere in the pipeline, everything simply hangs. 管道中的任何地方都不会抛出任何错误,一切都会挂起。 I can only tell when the process has completed as the VM CPU activity dies down. 我只能告诉进程何时完成,因为VM CPU活动已经停止。

This process does work on smaller files , and does send back an acknowledgement when the previous file has finished being imported, and moves onto the next. 此过程适用于较小的文件 ,并在上一个文件导入完成后发送回确认,并移至下一个文件。

I have tried running the scripts from both Jupyter notebook as well as a python script directly on the console. 我已经尝试直接在控制台上运行Jupyter笔记本中的脚本以及python脚本。 I have also even tried running the query directly on Neo4j through the browser console. 我甚至尝试通过浏览器控制台直接在Neo4j上运行查询。 Each way results in hanging queries, therefore I am not sure if the issue is coming from Neo4j or Py2Neo. 每种方式都会导致挂起查询,因此我不确定问题是来自Neo4j还是Py2Neo。

Example query: 示例查询:

USING PERIODIC COMMIT 1000
LOAD CSV FROM {csvfile}  AS line
MERGE (:Author { authorid: line[0], name: line[1] } )

Modified python script using Py2Neo: 使用Py2Neo修改了python脚本:

from azure.storage.blob import BlockBlobService
blob_service = BlockBlobService(account_name="<name>",account_key="<key>")
generator = blob_service.list_blobs("parsed-csv-files")

for blob in generator:
    print(blob.name)
    csv_file_base = "http://<base_uri>/parsed-csv-files/"
    csvfile = csv_file_base + blob.name
    params = { "csvfile":csvfile }
    mygraph.run(query, parameters=params )

Neo4j debug.log does not seem to be recording any errors. Neo4j debug.log似乎没有记录任何错误。

Sample debug.log: 示例debug.log:

2019-05-30 05:44:32.022+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job finished: descriptor=IndexRule[id=16, descriptor=Index( UNIQUE, :label[5](property[5]) ), provider={key=native-btree, version=1.0}, owner=42], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/16/index-16 Number of pages visited: 598507, Number of cleaned crashed pointers: 0, Time spent: 2m 25s 235ms
2019-05-30 05:44:32.071+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job closed: descriptor=IndexRule[id=16, descriptor=Index( UNIQUE, :label[5](property[5]) ), provider={key=native-btree, version=1.0}, owner=42], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/16/index-16
2019-05-30 05:44:32.071+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job started: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19
2019-05-30 05:44:57.126+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job finished: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19 Number of pages visited: 96042, Number of cleaned crashed pointers: 0, Time spent: 25s 55ms
2019-05-30 05:44:57.127+0000 INFO [o.n.k.i.i.s.GenericNativeIndexProvider] Schema index cleanup job closed: descriptor=IndexRule[id=19, descriptor=Index( UNIQUE, :label[6](property[6]) ), provider={key=native-btree, version=1.0}, owner=46], indexFile=/data/databases/graph.db/schema/index/native-btree-1.0/19/index-19

EDIT: used simpler query which still gives same issue 编辑:使用更简单的查询仍然提出相同的问题

Since the query would take lot of time to complete on DB side, may be py2neo is having issues with waiting. 由于查询将花费大量时间在数据库端完成,因此py2neo可能存在等待问题。

There should not be any issues with periodic commit. 定期提交不应该有任何问题。

Have you tried the Python neo4j driver and read csv from python and execute the query that way? 您是否尝试过Python neo4j驱动程序并从python中读取csv并以这种方式执行查询?

Here's sample code with neo4j driver. 这是neo4j驱动程序的示例代码。

import pandas as pd
from neo4j import GraphDatabase

driver = GraphDatabase.driver(serveruri, auth=(user,pwd))
with driver.session() as session:
    file = config['spins_file']
    row_chunks = pd.read_csv(file, sep=',', error_bad_lines=False,
                       index_col=False,
                       low_memory=False,
                       chunksize=config['chunk_size'])
    for i, rows in enumerate(row_chunks):
        print("Chunk {}".format(i))
        rows_dict = {'rows': rows.fillna(value="").to_dict('records')}
        session.run(statement="""
                    unwind data.rows as row
                    MERGE (:Author { authorid: line[0], name: line[1] } )
                    """,
                    dict=rows_dict)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM