简体   繁体   中英

Inserting large number of nodes into Neo4J

I have a table stored in a typical MySQL database and I've built a small parser tool using java, to parse out and build a neo4j database. This database will have ~40 million nodes, each with one or more edges (with a possible maximum of 10 edges). The problem comes from the way I have to create certain nodes. There's a user node, comment node, and hashtag node. The user nodes and hashtag nodes must each be unique. I'm using code from the following example to ensure uniqueness:

public Node getOrCreateUserWithUniqueFactory( String username, GraphDatabaseService graphDb )
{
    UniqueFactory<Node> factory = new UniqueFactory.UniqueNodeFactory( graphDb, "users" )
    {
    @Override
    protected void initialize( Node created, Map<String, Object> properties )
    {
        created.setProperty( "name", properties.get( "name" ) );
    }
};

return factory.getOrCreate( "name", username );

}

I have thought about using the batch inserter but I haven't seen a way to check if a node is unique while performing a batch insert. So my question is what is the fastest way to insert all these nodes while still ensuring that they retain their uniqueness. Any help would as always be greatly appreciated.

In case anyone else here runs into this problem I want to document what myself and a coworker were able to figure out in order to increase speed. First off a note or two about the data:

  • There were a large number of users they accounted for roughly 30% of the nodes
  • There were also a large number of hashtags as people will tend to hash just about anything
  • Both of these had to be guaranteed unique

Now that that's out of the way on to the optimizations. First and formost you need to ensure that your insert loop completes each time a node is inserted. There were no real examples of this for us to look at so intially the code looked like this (pseudo code)

Transaction begin
While(record.next()){
   parse record
   create unique user
   create unique hashtag
   create comment
   insert into graph
}
Transaction success
Transaction finish

While this worked ok and finished relatively quickly for small datasets it didn't scale well. So we took a look at the purpose of each function and refactored the code to look like the following:

While(record.next()){
   Transaction begin

   parse record
   create unique user
   create unique hashtag
   create comment
   insert into graph

   Transaction success
   Transaction finish
}

This greatly sped things up but it wasn't enough for my co-worker. So he found that Lucene indexes could be created on node attributes and that we could reference those in the Unique Node factory. This gave us another significant speed boost. So much so that we could insert 1,000,000 nodes in ~10 seconds without resorting to using the batch loader. Thanks to everyone for their help.

Why not create a local cache during the batch insert? You can use a java Map with Key name and Value NodeId (From the batch inserter).

Usually it is simplest by just keeping them in a HashMap . You won't have that many users and tags after all

You can also use the LuceneBatchInserterIndex and use setCapacity

see: http://docs.neo4j.org/chunked/milestone/batchinsert.html#indexing-batchinsert

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM