I have a table stored in a typical MySQL database and I've built a small parser tool using java, to parse out and build a neo4j database. This database will have ~40 million nodes, each with one or more edges (with a possible maximum of 10 edges). The problem comes from the way I have to create certain nodes. There's a user node, comment node, and hashtag node. The user nodes and hashtag nodes must each be unique. I'm using code from the following example to ensure uniqueness:
public Node getOrCreateUserWithUniqueFactory( String username, GraphDatabaseService graphDb )
{
UniqueFactory<Node> factory = new UniqueFactory.UniqueNodeFactory( graphDb, "users" )
{
@Override
protected void initialize( Node created, Map<String, Object> properties )
{
created.setProperty( "name", properties.get( "name" ) );
}
};
return factory.getOrCreate( "name", username );
}
I have thought about using the batch inserter but I haven't seen a way to check if a node is unique while performing a batch insert. So my question is what is the fastest way to insert all these nodes while still ensuring that they retain their uniqueness. Any help would as always be greatly appreciated.
In case anyone else here runs into this problem I want to document what myself and a coworker were able to figure out in order to increase speed. First off a note or two about the data:
Now that that's out of the way on to the optimizations. First and formost you need to ensure that your insert loop completes each time a node is inserted. There were no real examples of this for us to look at so intially the code looked like this (pseudo code)
Transaction begin
While(record.next()){
parse record
create unique user
create unique hashtag
create comment
insert into graph
}
Transaction success
Transaction finish
While this worked ok and finished relatively quickly for small datasets it didn't scale well. So we took a look at the purpose of each function and refactored the code to look like the following:
While(record.next()){
Transaction begin
parse record
create unique user
create unique hashtag
create comment
insert into graph
Transaction success
Transaction finish
}
This greatly sped things up but it wasn't enough for my co-worker. So he found that Lucene indexes could be created on node attributes and that we could reference those in the Unique Node factory. This gave us another significant speed boost. So much so that we could insert 1,000,000 nodes in ~10 seconds without resorting to using the batch loader. Thanks to everyone for their help.
Why not create a local cache during the batch insert? You can use a java Map
with Key name
and Value NodeId
(From the batch inserter).
Usually it is simplest by just keeping them in a HashMap
. You won't have that many users and tags after all
You can also use the LuceneBatchInserterIndex
and use setCapacity
see: http://docs.neo4j.org/chunked/milestone/batchinsert.html#indexing-batchinsert
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.