Neo4J: GC overhead limit exceeded when creating relationship

Question

code is:

load csv with headers from "https://data.cityofnewyork.us/api/views/feu5-w2e2/rows.csv?accessType=DOWNLOAD" as row
create (n:Contact) set n=row;

load csv with headers from "https://data.cityofnewyork.us/api/views/tesw-yqqr/rows.csv?accessType=DOWNLOAD" as row
create (n:Building) set n=row;

create index on :Contact(RegistrationID);

create index on :Building(RegistrationID);

Then when I try and create a relationship it goes up to like 8GB ram and eventually times out and the application dies using this:

MATCH (b:Building),(c:Contact)
where b.RegistrationID = c.RegistrationID
create (b)-[:CONTACTS_FOR]->(c)

Windows 7/64 bit i7-3770, 32GB ram...

Answer 1

If your dataset is large, you don't want to perform any query that operates over an entire label, you're almost guaranteed to see memory spike, timeout, and eventually an OutOfMemory error in the neo4j logs. The order you should perform those queries in is:

1) Make the indexes.

2) Merge one of the datasets and merge the relationships in advance like so, and use periodic commit so that you don't have to hold the whole result in memory [EDIT: Since RegistrationID is not a unique property, create a stand-in node for it, so that part of your relationship can be pre-built]:

USING PERIODIC COMMIT 500
load csv with headers from "https://data.cityofnewyork.us/api/views/tesw-yqqr/rows.csv?accessType=DOWNLOAD" as row
CREATE (b:Building) SET b=row
MERGE (i:RegistrationID {RegistrationID: row.RegistrationID})
MERGE (b) - [:CONTACTS_FOR] -> (i)

3) Then, merge in the other data set. Your existing index will make sure that if you use MERGE instead of CREATE , you'll get the RegistrationID nodes that are already created that you returned in your last step:

USING PERIOD COMMIT 500
load csv with headers from "https://data.cityofnewyork.us/api/views/feu5-w2e2/rows.csv?accessType=DOWNLOAD" as row
CREATE (c:Contact) SET c = row
MERGE (i:RegistrationID {RegistrationID: row.RegistrationID})
MERGE (i) - [:CONTACT_FOR] -> (c)

To query contacts by building then, use the * operator like so:

MATCH (b:Building) WHERE <whatever conditions you want>
WITH b
MATCH (b) <- [:CONTACT_FOR*2] - (c)
RETURN c

Now, if your data's already in and you don't want to reload, you'll have to use SKIP and LIMIT on your query to only process a chunk at a time:

MATCH (b:Building)
WITH b
SKIP 0
LIMIT 500
MATCH (c:Contact) WHERE c._id = b._id
MERGE (b) - [:CONTACTS_FOR] -> (c)

and then run it repeatedly, moving SKIP and LIMIT up by 500 each time.

Even if your machine is theoretically capable of handling that large a query, there are limits set by both neo4j and the JVM it runs on that may prevent you from leveraging it. Try to come up with a solution that uses lots of little iterative queries instead of one big one, it's usually a lot more effective.

Neo4J: GC overhead limit exceeded when creating relationship

Question

1 answers

solution1
1 2016-09-12 06:23:36

Neo4J: GC overhead limit exceeded when creating relationship

Question

1 answers

solution1 1 2016-09-12 06:23:36

solution1
1 2016-09-12 06:23:36