简体   繁体   English

Neo4j,使用 Cypher 命令批量加载

[英]Neo4j, bulk load with Cypher commands

I'm new to Neo4j and there must be something I don't understand about the basics.我是 Neo4j 的新手,一定有一些我不了解的基础知识。

I've many objects in Java and I want to use them to populate a Neo4j graph, using the Java driver and Cypher.我在 Java 中有很多对象,我想使用它们来填充 Neo4j 图,使用 Java 驱动程序和 Cypher。 My code works like this:我的代码是这样工作的:

// nodes
for ( Person person: persons )
  session.run ( String.format ( 
    "CREATE ( :Person { id: '%s', name: \"%s\", surname: \"%s\" })",
    person.getId(), person.getName(), person.getSurname ()
  ));

// relations
session.run ( "CREATE INDEX ON :Person(id)" );

for ( Friendship friendship: friendships )
  session.run ( String.format ( 
    "MATCH ( from:Person { id: '%s' } ), ( to:Person { id: '%s' } )\n" +
    "CREATE (from)-:KNOWS->(to)\n",
    friendship.getFrom().getId(), 
    friendship.getTo().getId() 
  )); 

(indeed, it's slightly more complicated, cause I have a dozen node types and about the same number of relation types). (实际上,它稍微复杂一些,因为我有十几种节点类型和大约相同数量的关系类型)。

Now, this is very slow, like more than 1 hour to load 300k nodes and 1M relations (on a fairly fast MacBookPro, with Neo4j taking 12/16GB of RAM).现在,这非常慢,比如加载 300k 节点和 1M 关系需要 1 个多小时(在相当快的 MacBookPro 上,Neo4j 占用 12/16GB 的 RAM)。

Am I doing it the wrong way?我做错了吗? Should I use the batch inserter instead?我应该改用批处理插入器吗? (I would prefer to be able to access the graphDB via network). (我希望能够通过网络访问 graphDB)。 Would I gain something by grouping more insertions into one transaction?通过将更多插入分组到一个事务中,我会有所收获吗? (From the documentation, It seems transactions are only useful for rolling back and for isolation needs). (从文档来看,事务似乎仅对回滚和隔离需求有用)。

I'm coming from Neo4j in Python, but I think the issue here is with your Cypher commands.我来自 Python 中的 Neo4j,但我认为这里的问题在于您的 Cypher 命令。 I have two suggestions.我有两个建议。

It may be faster to Match edges separately.单独匹配边可能会更快。 On my primitive benchmark I see a difference of 24ms vs 15ms with this (EDIT: This benchmark is dubious):在我的原始基准测试中,我看到了 24ms 与 15ms 的差异(编辑:这个基准测试是可疑的):

MATCH ( from:Person { id: '%s' } )
MATCH ( to:Person { id: '%s' } )
CREATE (from)-:KNOWS->(to)

Another option is to use UNWIND.另一种选择是使用 UNWIND。 I use this with the BOLT interface to send fewer transactions but without using the Batch Inserter.我将它与 BOLT 接口一起使用以发送更少的交易,但不使用批量插入器。 Forgive the Python implementation I'm copying here, and hopefully you can look at this along with the Javascript Neo4j Driver docs to convert it.请原谅我在这里复制的 Python 实现,希望您可以将其与 Javascript Neo4j 驱动程序文档一起查看以进行转换。

payload = {"list":[{"a":"Name1","b":"Name2"},{"a":"Name3","b":"Name4"}]}

statement = "UNWIND {list} AS d "
statement += "MATCH (A:Person {name: d.a}) "
statement += "MATCH (B:Person {name: d.b}) " 
statement += "MERGE (A)-[:KNOWS]-(B) "

tx = session.begin_transaction()
tx.run(statement,payload)
tx.commit()

I think it's worth to report my experience on this.我认为值得报告我在这方面的经验。

I've followed the @sjc suggestion and tried with UNWIND.我遵循了@sjc 的建议并尝试了 UNWIND。 However, that wasn't so simple, because Cypher doesn't allow you to parameterise node labels or relation types (and I have a dozen labels and relation types).然而,这并不是那么简单,因为 Cypher 不允许您参数化节点标签或关系类型(我有十几个标签和关系类型)。 But eventually, I was able to loop over all possible types and send enough items (about 1000) to each UNWIND chunk.但最终,我能够遍历所有可能的类型并向每个 UNWIND 块发送足够的项目(大约 1000 个)。

The code using UNWIND is much faster, yet not fast enough, in my opinion (should be OK on a decent PC and with few million nodes, not very good with hundreds of millions of nodes, or more).在我看来,使用 UNWIND 的代码要快得多,但还不够快(在一台像样的 PC 上应该没问题,并且只有几百万个节点,但在几亿个节点上不太好,或者更多)。

The inserter component is much faster (few seconds to upload 1-2 million nodes), although it requires to bring the HTTP access down and I've had a lot of problems with its dependency on Lucene 5.4, because I need to use it inside an application (which produces data) that uses Lucene 6, and awful things happened when I tried to simply swap 5.4 with 6 in the classpath.插入器组件要快得多(上传 1-2 百万个节点只需几秒钟),尽管它需要关闭 HTTP 访问,而且我在它对 Lucene 5.4 的依赖方面遇到了很多问题,因为我需要在内部使用它一个使用 Lucene 6 的应用程序(产生数据),当我试图在类路径中简单地将 5.4 与 6 交换时,发生了可怕的事情。 I've read that there is some mechanism to make this possible , but it doesn't seem easy and certainly isn't so well documented.我已经读到有一些机制可以使这成为可能,但这似乎并不容易,当然也没有很好的文档记录。

I definitely didn't expect all such troubles for executing such a basic operation efficiently.我绝对没想到高效执行这样一个基本操作会遇到这么多麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM