简体   繁体   English

Neo4j多客户端大规模插入-REST性能很差-其他方式吗?

[英]Neo4j multi-client massive insertion - REST very poor performance - other ways?

I'm trying to benchmark Neo4j massive insertion in client-server environment. 我正在尝试在客户端-服务器环境中对Neo4j大规模插入进行基准测试。 So far I've found that there are only two ways to do it: 到目前为止,我发现只有两种方法可以做到这一点:

  1. use REST 使用REST
  2. implement server extension 实施服务器扩展

I can say upfront that our design requires to be able to insert from many concurrently running processes/machines, so using batch insert with direct connection is not an option. 我可以预先说一下,我们的设计需要能够从许多同时运行的进程/机器中插入,因此不能将批处理插入与直接连接一起使用。

I would also like to avoid having to implement server extension as we already have tight schedule. 我也想避免必须实施服务器扩展,因为我们已经排定了时间表。

I benchmarked massive insertion via REST from just a single client , sending 2 kinds of very simple Cypher queries: 仅通过一个客户端就通过REST对大规模插入进行了基准测试,发送了2种非常简单的Cypher查询:

create (vertex:V {guid: {guid}, vtype: {vtype}, random1: {random1}, random2: {random2} })

match (a:V {guid: {a} }) match (b:V {guid: {b} }) create (a)-[:label]->(b)

Guid field had an index. 引导字段具有索引。

Results so far are very poor around (10k V + 40k E) in 13 minutes , compared to competing products like Titan or Orient , which provide efficient server out of the box and throughput at around (10k V + 40k E) per 1 minute . 与类似Titan或Orient的竞争产品相比,到目前为止的结果在13分钟内(10k V + 40k E)非常差,后者提供了开箱即用的高效服务器, 每1分钟约有(10k V + 40k E)的吞吐量。

I tried longer lasting transactions, and query parameters, none give any significant gains. 我尝试了更持久的事务,并且查询参数没有任何明显的收获。 Furthermore, the overhead from REST is very small as I tested dummy queries and they execute much much faster (and both client and server are on the same machine). 此外,由于我测试了虚拟查询,因此REST的开销很小,而且执行速度要快得多(而且客户端和服务器都在同一台计算机上)。 I also tried inserting from multiple threads - performance does not scale up. 我还尝试从多个线程插入-性能无法提高。

I found another StackOverflow question, where advise was to batch inserts into large requests containing thousands of commands and periodically commit. 我发现了另一个StackOverflow问题,建议将批处理插入包含数千个命令的大型请求中并定期提交。 Unfortunatelly, due to the nature of how we generate the data, batching the requests is not feasible. 不幸的是,由于我们如何生成数据的性质,对请求进行批处理是不可行的。 Ideally we'd like the inserts to be atomic operations and have the results appear as soon as they are executed (no need for transactions in fact). 理想情况下,我们希望插入是原子操作,并在执行结果后立即显示结果(实际上不需要事务)。

Thus my questions are: 因此,我的问题是:

  1. are my Cypher queries optimal for the insertion? 我的Cypher查询最适合插入吗?
  2. are the results so far in line with what can be achieved with REST (or can I squeeze much more from REST) ? 到目前为止的结果是否与REST可以达到的效果一致(或者我可以从REST中获得更多收益)?
  3. are there any other ways to perform efficient multi-client massive insertion? 还有其他方法可以执行有效的多客户端大规模插入吗?

I have a number of thoughts/questions that don't fit very well in a comment ;) 我有很多想法/问题都不太适合评论;)

  • What version of Neo4j are you using? 您正在使用什么版本的Neo4j? 2.3 introduced some things which might help 2.3介绍了一些可能有帮助的东西

  • When you say you have an index, do you mean the new style and not the legacy indexes? 当您说有索引时,是指新样式,而不是旧索引? The newer indexes are created with CREATE INDEX ON :V(guid) and apply to the combination of a label and a property. 较新的索引是使用CREATE INDEX ON :V(guid)创建的,并应用于标签和属性的组合。 You can try your queries in the web console prefixed with PROFILE to see if the query is hitting the index and where it might be slow 您可以在带有PROFILE前缀的Web控制台中尝试查询,以查看查询是否命中索引以及索引可能会变慢

  • If you can have the data in a CSV format you might look into the LOAD CSV clause in Cypher. 如果可以CSV格式存储数据,则可以查看Cypher中的LOAD CSV子句。 That's also a batch sort of thing, so it might not be as useful 那也是一堆东西,所以它可能没有用

  • I don't think it would help performance much, but this is a bit nicer to read: 我认为这不会对性能有多大帮助,但这一点读起来更好:

    match (a:V {guid: {a} }), (b:V {guid: {b} }) create (a)-[:label]->(b) 匹配(a:V {guid:{a}}),(b:V {guid:{b}})创建(a)-[:label]->(b)

  • I know it's of no help now, but Neo4j 3.0 is planned to have a new compressed binary socket protocol called Bolt which should be an improvement over REST. 我知道现在没有任何帮助,但是Neo4j 3.0计划有一个名为Bolt的新压缩二进制套接字协议,这应该是对REST的改进。 It's estimated for Q2 预计第二季度

I know a lot of these suggestions probably aren't too helpful, but they're things to think about. 我知道其中许多建议可能并没有太大帮助,但这些都是要考虑的问题。 There's also a public Slack chat for Neo4j here: 这里还有Neo4j的公共Slack聊天:

http://neo4j.com/blog/public-neo4j-users-slack-group/ http://neo4j.com/blog/public-neo4j-users-slack-group/

I'll share this question there to see if anybody has any ideas 我将在这里分享这个问题,看看是否有人有任何想法

EDIT: 编辑:

Max DeMarzi passed on one of this articles on queueing requests which might be useful: Max DeMarzi通过了其中一篇有关对请求进行排队的文章,这可能会有用:

http://maxdemarzi.com/2014/07/01/scaling-concurrent-writes-in-neo4j/ http://maxdemarzi.com/2014/07/01/scaling-concurrent-writes-in-neo4j/

Looks like you'd need to write a bit of Java, but he lays it out for you 看起来您需要编写一些Java,但是他为您准备了Java

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM