简体   繁体   中英

How can I efficiently create unique relationships in Neo4j?

Following up on my question here , I would like to create a constraint on relationships. That is, I would like there to be multiple nodes that share the same "neighborhood" name, but each uniquely point to a particular city in which they reside.

As encouraged in user2194039's answer , I am using the following index:

CREATE INDEX ON :Neighborhood(name)

Also, I have the following constraint:

CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;

The following code fails to create unique relationships, and takes an excessively long period of time:

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});

Note that there is a uniqueness constraint on City, but NOT on Neighborhood (because there should be multiple ones).

Profile with Limit 10,000:

+--------------+------+--------+---------------------------+------------------------------+
|     Operator | Rows | DbHits |               Identifiers |                        Other |
+--------------+------+--------+---------------------------+------------------------------+
|  EmptyResult |    0 |      0 |                           |                              |
|  UpdateGraph |    9750 |      3360 | anon[307], b, neighborhood, line |                 MergePattern |
|  SchemaIndex |    9750 |      19500 |                   b, line | line.City; :City(name) |
| ColumnFilter |    9750 |      0 |                      line |            keep columns line |
|       Filter |    9750 |      0 |           anon[220], line |                    anon[220] |
|      Extract |    10000 |      0 |           anon[220], line |                    anon[220] |
|        Slice |    10000 |      0 |                      line |                 {  AUTOINT0} |
|      LoadCSV |    10000 |      0 |                      line |                              |
+--------------+------+--------+---------------------------+------------------------------+

Total database accesses: 22860

Following Guilherme's recommendation below, I implemented the helper yet it is raising the error py2neo.error.Finished. I've searched the documentation, and wasn't able to determine a work around from this . It looks like there's an open SO post about this exception .

def run_batch_query(queries, timeout=None):
if timeout:
    http.socket_timeout = timeout
try:
    graph = Graph()
    authenticate("localhost:7474", "account", "password")
    tx = graph.cypher.begin()
    for query in queries:
        statement, params = query
        tx.append(statement, params)
        results = tx.process()
        tx.commit()
except http.SocketError as err:
    raise err
except error.Finished as err:
    raise err
collection = []
for result in results:
    records = []
    for record in result:
        records.append(record)
    collection.append(records)  
return collection

main:

queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()

# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
    for neighborhood in neighborhoods:
        params = dict(city=city, neighborhood=neighborhood)
        queries.append((statement, params))
        c +=1
        if c % batch == 0:
            print "running batch"
            print c
            s = time.time()*1000
            r = run_batch_query(queries, 10)
            e = time.time()*1000
            print("\t{0}, {1:.00f}ms".format(c, e-s))
            del queries[:]

print c
if queries:
    s = time.time()*1000 
    r = run_batch_query(queries, 300)
    e = time.time()*1000
    print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))

If you want to create unique relationships you have 2 options:

  1. Prevent the path from being duplicated, using MERGE, just like @user2194039 suggested. I think this is the simplest, and best approach you can take.

  2. Turn your relationship into a node, and create an unique constraint on it. But it's hardly necessary for most cases.

If you're having trouble with speed, try using the transactional endpoint. I tried importing your data (random cities and neighbourhoods) through IMPORT CSV in 2.2.1, and I it was slow as well, though I am not sure why. If you send your queries with parameters to the transactional endpoint in batches of 1000-5000, you can monitor the process, and probably gain a performance boost. I managed to import 1M rows in just under 11 minutes.

I used an INDEX for Neighbourhood(name) and a unique constraint for City(name). Give it a try and see if it works for you.

Edit:

The transactional endpoint is a restful endpoint that allows you do execute transactions in batch. You can read about it here . Basically, it allows you to stream a bunch of queries to the server at once.

I don't know what programming language/stack you're using, but in python, using a package like py2neo, it would be something like this:

with open("city.csv", "r") as fp:

    reader = csv.reader(fp)

    queries = []
    template = ["MERGE (c :`City` {name: {city}})",
                "MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]

    statement = '\n'.join(template)

    batch = 5000

    c = 1

    start = time.time()

    for row in reader:

        city, neighborhood = row

        params = dict(city=city, neighborhood=neighborhood)

        queries.append((statement, params))

        if c % batch == 0:

            s = time.time()*1000
            r = neo4j.run_batch_query(queries, 10)
            e = time.time()*1000
            print("\t{0}, {1:.00f}ms".format(c, e-s))
            del queries[:]

        c += 1

    if queries:

        s = time.time()*1000
        r = neo4j.run_batch_query(queries, 300)
        e = time.time()*1000
        print("\t{0} {1:.00f}ms".format(c, e-s))

    end = time.time()

    print("End. {0}s".format(end-start))

Helper functions:

def run_batch_query(queries, timeout=None):

    if timeout:
        http.socket_timeout = timeout

    try:
        graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
        tx = graph.cypher.begin()

        for query in queries:
            statement, params = query

            tx.append(statement, params)

        results = tx.process()

        tx.commit()

    except http.SocketError as err:
        raise err

    collection = []
    for result in results:

        records = []

        for record in result:

            records.append(record)

        collection.append(records)

    return collection

You will monitor how long each transaction takes, and you can tweak the number of queries per transactions, as well as the timeout.

To be sure we're on the same page, this is how I understand your model: Each city is unique and should have some number of neighborhoods pointing to it. The neighborhoods are unique within the context of a city, but not globally. So if you have a neighborhood 3 [IN] city Boston , you could also have a neighborhood 3 [IN] city Seattle , and both of those neighborhoods are represented by different nodes, even though they have the same name property. Is that correct?

Before importing, I would recommend adding an index to your neighborhood nodes. You can add the index without enforcing uniqueness. I have found that this greatly increases speeds on even small databases.

CREATE INDEX ON :Neighborhood(name)

And for the import:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})

If you are importing a large amount of data, it may be best to use the USING PERIODIC COMMIT command to commit periodically while importing. This will reduce the memory used in the process, and if your server is memory-constrained, I could see it helping performance. In your case, with almost a million records, this is recommended by Neo4j. You can even adjust how often the commit happens by doing USING PERIODIC COMMIT 10000 or such. The docs say 1000 is the default. Just understand that this will break the import into several transactions.

Best of luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM