Cypher Neo4j - Query that uses the clause 'IN' on the collection is very slow

Question

Hi i'm trying to import some data from CSV files in Neo4j 2.3.1 . I've already imported some nodes of type :Author and :Article .

The Author node is composed of properties like:

key -> String
principal_name -> String
alias -> Collection of String
........

I've also added index on principal_name, alias and key.

The problem comes when I try to import the relationships between nodes of type Article and Author.

The CSV has this type of structure:

articleKey,authorName

Has a naive solution i've tried to create the relationship using a query like this one:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///myPath.csv" AS line
MATCH (art:Article{key: line.key1})
MATCH (auth:Author) WHERE line.key2 IN (auth.alias)
CREATE UNIQUE (auth)-[:AUTHOR_OF]->(art);

The query is painfully slow because the second MATCH is really slow as i discovered using the profiler. It takes 10-12 seconds to create every relation because i've many Authors in the db(around 1000000).

So i'm looking for a way to execute a query like this one to get a faster execution(is an example to illustrate the structure that i want to obtain):

MATCH (auth:Author{principal_name: line.key2})
IF auth null THEN
  MATCH (auth:Author) WHERE line.key2 IN (auth.alias)
END

There is a way to do that with Cypher ?

Answer 1

If you changed your model so that all of an Author node's names (both the principal name and all the aliases) are all in separate Name nodes, like this:

(auth:Author)-[:HAS_NAME]->(name:Name {name: 'Fred McGillicutty'})

Then the query would be simply:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///myPath.csv" AS line
MATCH
  (art:Article { key: line.key1 }),
  (auth:Author)-[:HAS_NAME]->(name:Name { name:line.key2 })
CREATE (auth)-[:AUTHOR_OF]->(art);

If you create indexes on :Article(key) , and :Name(name) , this query should be very efficient.

Answer 2

If many authors have aliases and if you expect to query on these aliases you should model them as nodes. I think this will speed up queries for creating relationships and allows for more flexible queries involving aliases.

(:Alias)<-[:HAS]-(:Author)-[:AUTHOR_OF]->(:Article)

Add indexes on all nodes. If possible use uniqueness constraints .

You can now query for Alias and Author nodes to add relationships:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///myPath.csv" AS line
MATCH (art:Article {key: line.key1})
// get the Author directly or by alias
MATCH (alias:Alias)<-[:HAS]-(auth:Author)
WHERE alias.principal_name = line.key2 OR auth.principal_name = line.key2
CREATE (auth)-[:AUTHOR_OF]->(art)

With indexes the lookups should be pretty fast.

Cypher Neo4j - Query that uses the clause 'IN' on the collection is very slow

Question

2 answers

solution1
1 ACCPTED 2015-12-03 18:23:50

solution2
0 2015-12-03 16:05:25

Cypher Neo4j - Query that uses the clause 'IN' on the collection is very slow

Question

2 answers

solution1 1 ACCPTED 2015-12-03 18:23:50

solution2 0 2015-12-03 16:05:25

solution1
1 ACCPTED 2015-12-03 18:23:50

solution2
0 2015-12-03 16:05:25