简体   繁体   中英

Reading lots of csv data into neo4j using execute_query, ruby and neography

I wrote a quick ruby routine to load some very large csv data. I got frustrated with various out of memory issues trying to use load_csv so reverted to ruby. I'm relatively new to neo4j so trying Neography to just call a cypher query I create as a string.

The cypher code is using merge to add a relationship between 2 existing nodes:

cmdstr=match (a:Provider {npi: xxx}),(b:Provider {npi:yyy}) merge (a)-[:REFERS_TO {qty: 1}]->(b);

@neo.execute_query(cmdstr)

I'm just looping through the rows in a file running these. It fails after about 30000 rows with socket error "cannot assign requested address". I believe GC is somehow causing issues. However the logs don't tell me anything. I've tried tuning GC differently, and trying different amounts of heap. Fails in the same place everytime. Any help appreciated.

[edit] More info - Running netstat --inet shows thousands of connections to localhost:7474. Does execute_query not reuse connections by design or is this an issue?

I've now tried parameters and the behavior is the same. How would you code this kind of query using batches and make sure you use the index on npi?

I was finally able to get this to work by changing the MERGE to a CREATE (deleting all relationships first). Still took a long time but it stayed linear relative to the number of relationships.

I also changed garbage collection from Concurrent/Sweep to parallelGC. The concurrent sweep would just fail and revert to a full GC anyway.

#wrapper.java.additional=-XX:+UseConcMarkSweepGC wrapper.java.additional=-XX:+UseParallelGC wrapper.java.additonal=-XX:+UseNUMA wrapper.java.additional=-XX:+CMSClassUnloadingEnabled wrapper.java.additional=-Xmn630m

With Neo4j 2.1.3 the LOAD CSV issue is resolved:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "http://npi_data.csv" as line
MATCH (a:Provider {npi: line.xxx})
MATCH (b:Provider {npi: line.yyy}) 
MERGE (a)-[:REFERS_TO {qty: line.qty}]->(b);

In your ruby code you should use Cypher parameters and probably the transactional API . Do you limit the concurrency of your requests somehow (eg single client)?

Also make sure to have an index or constraint created for your providers:

 create index on :Provider(npi);

or

 create constraint on (p:Provider) assert p.npi is unique;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM