使用Java API在Neo4j中插入節點時性能不佳

Question

我試圖在Neo4j中插入大約200萬個節點，但性能出現問題。

我使用帶有Java編寫的服務器擴展的neo4j企業版2.2.0。 我的計算機具有ssd，32gb ram，Intel Core i7 cpu並正在運行Windows8。我運行服務器的獨立版本，並通過在bin文件夾中運行Neo4j.bat來啟動它。

現在插入10,000個沒有關系的節點大約需要25秒（我將需要稍后添加關系，但此時是一個問題）。

我認為這是配置問題，因此我嘗試了一些設置，但性能沒有變化。 我發現很奇怪的是，即使我在neo4j-wrapper.conf中將initmemory和maxmemory設置設置為15000，Java進程也最多只能分配3gb。

我在下面附加了我的代碼和配置，是否有人知道我在做什么錯？ 插入大圖時，我應該期待什么性能？

插入代碼

for (Thing t : things) {
    List<ValuePair> properties = parseThing(t);
    String uid = createUid(t);

    try (Transaction tx = graphDb.beginTx()) {

        Node node = graphDb.createNode();
        node.setProperty("uid", uid);

        for (ValuePair vp : properties) {
            node.setProperty(vp.getName(), vp.getValue());
        }

        tx.success();
    }
}

（首先，我在創建節點時添加了DynamicLabel，但速度甚至更慢。如果要在插入節點時獲得良好的性能，是否可以使用標簽？）

配置

neo4j.properties

################################################################
# Neo4j
#
# neo4j.properties - database tuning parameters
#
################################################################

# Enable this to be able to upgrade a store from an older version.
#allow_store_upgrade=true

# The amount of memory to use for mapping the store files, in bytes (or
# kilobytes with the 'k' suffix, megabytes with 'm' and gigabytes with 'g').
# If Neo4j is running on a dedicated server, then it is generally recommended
# to leave about 2-4 gigabytes for the operating system, give the JVM enough
# heap to hold all your transaction state and query context, and then leave the
# rest for the page cache.
# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 75% of RAM minus the max Java heap size.
dbms.pagecache.memory=4g

# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0

# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true".
#keep_logical_logs=7 days

# Autoindexing

# Enable auto-indexing for nodes, default is false.
#node_auto_indexing=true

# The node property keys to be auto-indexed, if enabled.
#node_keys_indexable=name,age

# Enable auto-indexing for relationships, default is false.
#relationship_auto_indexing=true

# The relationship property keys to be auto-indexed, if enabled.
#relationship_keys_indexable=name,age

# Enable shell server so that remote clients can connect via Neo4j shell.
#remote_shell_enabled=true
# The network interface IP the shell will listen on (use 0.0.0 for all interfaces).
#remote_shell_host=127.0.0.1
# The port the shell will listen on, default is 1337.
#remote_shell_port=1337

# The type of cache to use for nodes and relationships.
cache_type=hpc

cache.memory_ratio=70

# Maximum size of the heap memory to dedicate to the cached nodes.
node_cache_size=2g
#relationship_cache_size=6g

# Maximum size of the heap memory to dedicate to the cached relationships.
#relationship_cache_size=

# Enable online backups to be taken from this database.
online_backup_enabled=true

# Port to listen to for incoming backup requests.
online_backup_server=127.0.0.1:6362


# Uncomment and specify these lines for running Neo4j in High Availability mode.
# See the High availability setup tutorial for more details on these settings
# http://neo4j.com/docs/2.2.0/ha-setup-tutorial.html

# ha.server_id is the number of each instance in the HA cluster. It should be
# an integer (e.g. 1), and should be unique for each cluster instance.
#ha.server_id=

# ha.initial_hosts is a comma-separated list (without spaces) of the host:port
# where the ha.cluster_server of all instances will be listening. Typically
# this will be the same for all cluster instances.
#ha.initial_hosts=192.168.0.1:5001,192.168.0.2:5001,192.168.0.3:5001

# IP and port for this instance to listen on, for communicating cluster status
# information iwth other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.cluster_server=192.168.0.1:5001

# IP and port for this instance to listen on, for communicating transaction
# data with other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.server=192.168.0.1:6001

# The interval at which slaves will pull updates from the master. Comment out
# the option to disable periodic pulling of updates. Unit is seconds.
ha.pull_interval=10

# Amount of slaves the master will try to push a transaction to upon commit
# (default is 1). The master will optimistically continue and not fail the
# transaction even if it fails to reach the push factor. Setting this to 0 will
# increase write performance when writing through master but could potentially
# lead to branched data (or loss of transaction) if the master goes down.
#ha.tx_push_factor=1

# Strategy the master will use when pushing data to slaves (if the push factor
# is greater than 0). There are two options available "fixed" (default) or
# "round_robin". Fixed will start by pushing to slaves ordered by server id
# (highest first) improving performance since the slaves only have to cache up
# one transaction at a time.
#ha.tx_push_strategy=fixed

# Policy for how to handle branched data.
#branched_data_policy=keep_all

# Clustering timeouts
# Default timeout.
#ha.default_timeout=5s

# How often heartbeat messages should be sent. Defaults to ha.default_timeout.
#ha.heartbeat_interval=5s

# Timeout for heartbeats between cluster members. Should be at least twice that of ha.heartbeat_interval.
#heartbeat_timeout=11s

neo4j-server.properties

################################################################
# Neo4j
#
# neo4j-server.properties - runtime operational settings
#
################################################################

#***************************************************************
# Server configuration
#***************************************************************

# location of the database directory
org.neo4j.server.database.location=data/graph.db

# Low-level graph engine tuning file
org.neo4j.server.db.tuning.properties=conf/neo4j.properties

# Database mode
# Allowed values:
# HA - High Availability
# SINGLE - Single mode, default.
# To run in High Availability mode, configure the neo4j.properties config file, then uncomment this line:
#org.neo4j.server.database.mode=HA

# Let the webserver only listen on the specified IP. Default is localhost (only
# accept local connections). Uncomment to allow any connection. Please see the
# security section in the neo4j manual before modifying this.
#org.neo4j.server.webserver.address=0.0.0.0

# Require (or disable the requirement of) auth to access Neo4j
dbms.security.auth_enabled=true

#
# HTTP Connector
#

# http port (for all data, administrative, and UI access)
org.neo4j.server.webserver.port=7474

#
# HTTPS Connector
#

# Turn https-support on/off
org.neo4j.server.webserver.https.enabled=true

# https port (for all data, administrative, and UI access)
org.neo4j.server.webserver.https.port=7473

# Certificate location (auto generated if the file does not exist)
org.neo4j.server.webserver.https.cert.location=conf/ssl/snakeoil.cert

# Private key location (auto generated if the file does not exist)
org.neo4j.server.webserver.https.key.location=conf/ssl/snakeoil.key

# Internally generated keystore (don't try to put your own
# keystore there, it will get deleted when the server starts)
org.neo4j.server.webserver.https.keystore.location=data/keystore

# Comma separated list of JAX-RS packages containing JAX-RS resources, one
# package name for each mountpoint. The listed package names will be loaded
# under the mountpoints specified. Uncomment this line to mount the
# org.neo4j.examples.server.unmanaged.HelloWorldResource.java from
# neo4j-server-examples under /examples/unmanaged, resulting in a final URL of
# http://localhost:7474/examples/unmanaged/helloworld/{nodeId}
#org.neo4j.server.thirdparty_jaxrs_classes=org.neo4j.examples.server.unmanaged=/examples/unmanaged

org.neo4j.server.thirdparty_jaxrs_classes=my.project.package=/mypath

#*****************************************************************
# HTTP logging configuration
#*****************************************************************

# HTTP logging is disabled. HTTP logging can be enabled by setting this
# property to 'true'.
org.neo4j.server.http.log.enabled=false

# Logging policy file that governs how HTTP log output is presented and
# archived. Note: changing the rollover and retention policy is sensible, but
# changing the output format is less so, since it is configured to use the
# ubiquitous common log format
org.neo4j.server.http.log.config=conf/neo4j-http-logging.xml

#*****************************************************************
# Administration client configuration
#*****************************************************************

# location of the servers round-robin database directory. possible values:
# - absolute path like /var/rrd
# - path relative to the server working directory like data/rrd
# - commented out, will default to the database data directory.
org.neo4j.server.webadmin.rrdb.location=data/rrd

Neo4j的-wrapper.conf

#********************************************************************
# Property file references
#********************************************************************

wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties

#********************************************************************
# JVM Parameters
#********************************************************************

wrapper.java.additional.1=-XX:+UseConcMarkSweepGC
wrapper.java.additional.2=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional.3=-XX:-OmitStackTraceInFastThrow
wrapper.java.additional.4=-XX:hashCode=5

# Remote JMX monitoring, uncomment and adjust the following lines as needed.
# Also make sure to update the jmx.access and jmx.password files with appropriate permission roles and passwords,
# the shipped configuration contains only a read only role called 'monitor' with password 'Neo4j'.
# For more details, see: http://download.oracle.com/javase/7/docs/technotes/guides/management/agent.html
# On Unix based systems the jmx.password file needs to be owned by the user that will run the server,
# and have permissions set to 0600.
# For details on setting these file permissions on Windows see:
#     http://docs.oracle.com/javase/7/docs/technotes/guides/management/security-windows.html
#wrapper.java.additional=-Dcom.sun.management.jmxremote.port=3637
#wrapper.java.additional=-Dcom.sun.management.jmxremote.authenticate=true
#wrapper.java.additional=-Dcom.sun.management.jmxremote.ssl=false
#wrapper.java.additional=-Dcom.sun.management.jmxremote.password.file=conf/jmx.password
#wrapper.java.additional=-Dcom.sun.management.jmxremote.access.file=conf/jmx.access

# Some systems cannot discover host name automatically, and need this line configured:
#wrapper.java.additional=-Djava.rmi.server.hostname=$THE_NEO4J_SERVER_HOSTNAME

# Uncomment the following lines to enable garbage collection logging
#wrapper.java.additional=-Xloggc:data/log/neo4j-gc.log
#wrapper.java.additional=-XX:+PrintGCDetails
#wrapper.java.additional=-XX:+PrintGCDateStamps
#wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
#wrapper.java.additional=-XX:+PrintPromotionFailure
#wrapper.java.additional=-XX:+PrintTenuringDistribution

# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=15000
wrapper.java.maxmemory=15000

#********************************************************************
# Wrapper settings
#********************************************************************
# path is relative to the bin dir
wrapper.pidfile=../data/neo4j-server.pid

#********************************************************************
# Wrapper Windows NT/2000/XP Service Properties
#********************************************************************
# WARNING - Do not modify any of these properties when an application
#  using this configuration file has been installed as a service.
#  Please uninstall the service before modifying this section.  The
#  service can then be reinstalled.

# Name of the service
wrapper.name=neo4j

# User account to be used for linux installs. Will default to current
# user if not set.
wrapper.user=

#********************************************************************
# Other Neo4j system properties
#********************************************************************
wrapper.java.additional=-Dneo4j.ext.udc.source=zip

wrapper.java.additional=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 -Xdebug-Xnoagent-Djava.compiler=NONE-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005

如果您能幫助我解決這個問題，您會讓我非常高興！

Answer 1

您需要在事務中創建多個節點，否則事務開銷會消耗大部分時間。

請嘗試這種方式：

try (Transaction tx = graphDb.beginTx()) {

    for (Thing t : things) {

        List<ValuePair> properties = parseThing(t);
        String uid = createUid(t);

        Node node = graphDb.createNode();
        node.setProperty("uid", uid);

        for (ValuePair vp : properties) {
            node.setProperty(vp.getName(), vp.getValue());
        }
    }

    tx.success();
}

Answer 2

非常感謝Christian Morgner和Michael Hunger為我指出正確的方向！

解決方案是拆分列表，進行較小的事務並使用線程。 首先，我添加所有節點，然后添加所有關系。 您可以使用批處理大小，最好是取決於您的圖形。

這是我的代碼（簡體）：

主要

public static final int CPU = Runtime.getRuntime().availableProcessors()*2;
public static final int BATCH_NODES = 100_000;
public static final int BATCH_RELATIONS = 50_000;


ExecutorService pool = createPool(CPU, CPU * 25);

for(int i = 0; i < things.size(); i = i + BATCH_NODES) {
    CreateNodeAndRelationRunner nodeRunner;
    if(i + BATCH_NODES < things.size()) {
        nodeRunner = new CreateNodeRunner(graphDb, things.subList(i, i + BATCH_NODES));
    } else {
        nodeRunner = new CreateNodeRunner(graphDb, things.subList(i, things.size()));
    }

    pool.submit(nodeRunner);
}
pool.shutdown();

boolean nodesCreated = false;
try {
        nodesCreated = pool.awaitTermination(1, TimeUnit.DAYS);
} catch (InterruptedException e) {
        logger.debug("CreateNodeThread was interrupted");
        logger.debug(e.getMessage());
}

if(nodesCreated) {

        pool = createPool(CPU, CPU * 25);

        for(int i = 0; i < things.size(); i=i+ BATCH_RELATIONS) {
            CreateRelationsRunner relationsRunner;
            if(i+ BATCH_RELATIONS < things.size()) {
                relationsRunner = new CreateRelationsRunner(graphDb, things.subList(i, i+ BATCH_RELATIONS));
            } else {
                relationsRunner = new CreateRelationsRunner(graphDb, things.subList(i, things.size()));
            }

            pool.submit(relationsRunner);
        }
        pool.shutdown();
}

CreateNodeRunner.java

public class CreateNodeRunner implements Runnable {

    private List<Thing> things;
    private GraphDatabaseService graphDb;

    public CreateNodeRunner(GraphDatabaseService graphDb, List<Thing> things) {
        this.things = things;
        this.graphDb = graphDb;
    }

    @Override
    public void run() {

        try (Transaction tx = graphDb.beginTx()) {

            for(Thing t : things) {
                Node node = graphDb.createNode(t.getLabel());
                node.setProperty("uid", t.getUid());

                for (ValuePair vp : t.getProperties()) {
                    node.setProperty(vp.getName(), vp.getValue());
                }
            }
            tx.success();
        }
    }
}

CreateRelationsRunner.java

public class CreateRelationsRunner implements Runnable {

    private GraphDatabaseService graphDb;
    private List<Thing> things;

    public CreateRelationsRunner(GraphDatabaseService graphDb, List<Thing> things) {
        this.graphDb = graphDb;
        this.things = things;
    }

    @Override
    public void run() {

        try (Transaction tx = graphDb.beginTx()) {
            for(Thing tFrom : things) {

                List<ValuePair> relations = tFrom.getRelations();

                Label label = tFrom.getLabel();
                Node firstNode = graphDb.findNode(label, "uid", tFrom.getUid());

                for(ValuePair vp : relations) {
                    Thing tTo = (Thing) vp.getValue();

                    label = tTo.getLabel();
                    Node secondNode = graphDb.findNode(label, "uid", tTo.getUid());

                    RelationshipType relType = vp.getRelationshipType();
                    firstNode.createRelationshipTo(secondNode, relType);

                }
            }

            tx.success();
        }

    }
}

如果您發現錯誤或看到可能的改進，請告訴我。 :)

使用Java API在Neo4j中插入節點時性能不佳

問題描述

插入代碼

配置

2 個解決方案

解決方案1
2 2015-07-10 08:23:33

解決方案2
2 已采納 2015-09-22 14:36:01

使用Java API在Neo4j中插入節點時性能不佳

問題描述

插入代碼

配置

2 個解決方案

解決方案1 2 2015-07-10 08:23:33

解決方案2 2 已采納 2015-09-22 14:36:01

解決方案1
2 2015-07-10 08:23:33

解決方案2
2 已采納 2015-09-22 14:36:01