简体   繁体   English

如何有效地在Neo4j中创建独特的关系?

[英]How can I efficiently create unique relationships in Neo4j?

Following up on my question here , I would like to create a constraint on relationships. 这里回答我的问题之后,我想对关系建立约束。 That is, I would like there to be multiple nodes that share the same "neighborhood" name, but each uniquely point to a particular city in which they reside. 也就是说,我希望有多个共享相同“邻居”名称的节点,但是每个节点唯一地指向它们所居住的特定城市。

As encouraged in user2194039's answer , I am using the following index: 正如user2194039的回答中所鼓励的那样,我正在使用以下索引:

CREATE INDEX ON :Neighborhood(name)

Also, I have the following constraint: 另外,我有以下约束:

CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;

The following code fails to create unique relationships, and takes an excessively long period of time: 以下代码无法创建唯一的关系,并且花费了非常长的时间:

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});

Note that there is a uniqueness constraint on City, but NOT on Neighborhood (because there should be multiple ones). 请注意,城市有一个唯一性约束,而邻里则没有(因为应该有多个)。

Profile with Limit 10,000: 限制为10,000的配置文件:

+--------------+------+--------+---------------------------+------------------------------+
|     Operator | Rows | DbHits |               Identifiers |                        Other |
+--------------+------+--------+---------------------------+------------------------------+
|  EmptyResult |    0 |      0 |                           |                              |
|  UpdateGraph |    9750 |      3360 | anon[307], b, neighborhood, line |                 MergePattern |
|  SchemaIndex |    9750 |      19500 |                   b, line | line.City; :City(name) |
| ColumnFilter |    9750 |      0 |                      line |            keep columns line |
|       Filter |    9750 |      0 |           anon[220], line |                    anon[220] |
|      Extract |    10000 |      0 |           anon[220], line |                    anon[220] |
|        Slice |    10000 |      0 |                      line |                 {  AUTOINT0} |
|      LoadCSV |    10000 |      0 |                      line |                              |
+--------------+------+--------+---------------------------+------------------------------+

Total database accesses: 22860 数据库访问总数:22860

Following Guilherme's recommendation below, I implemented the helper yet it is raising the error py2neo.error.Finished. 遵循Guilherme的以下建议,我实现了帮助程序,但它正在引发错误py2neo.error.Finished。 I've searched the documentation, and wasn't able to determine a work around from this . 我搜索的文件,而无法从各地确定的工作这个 It looks like there's an open SO post about this exception . 似乎有关于此异常公开SO帖子

def run_batch_query(queries, timeout=None):
if timeout:
    http.socket_timeout = timeout
try:
    graph = Graph()
    authenticate("localhost:7474", "account", "password")
    tx = graph.cypher.begin()
    for query in queries:
        statement, params = query
        tx.append(statement, params)
        results = tx.process()
        tx.commit()
except http.SocketError as err:
    raise err
except error.Finished as err:
    raise err
collection = []
for result in results:
    records = []
    for record in result:
        records.append(record)
    collection.append(records)  
return collection

main: 主要:

queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()

# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
    for neighborhood in neighborhoods:
        params = dict(city=city, neighborhood=neighborhood)
        queries.append((statement, params))
        c +=1
        if c % batch == 0:
            print "running batch"
            print c
            s = time.time()*1000
            r = run_batch_query(queries, 10)
            e = time.time()*1000
            print("\t{0}, {1:.00f}ms".format(c, e-s))
            del queries[:]

print c
if queries:
    s = time.time()*1000 
    r = run_batch_query(queries, 300)
    e = time.time()*1000
    print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))

If you want to create unique relationships you have 2 options: 如果要创建唯一关系,则有两种选择:

  1. Prevent the path from being duplicated, using MERGE, just like @user2194039 suggested. 就像建议的@ user2194039一样,使用MERGE防止路径重复。 I think this is the simplest, and best approach you can take. 我认为这是您可以采用的最简单,最好的方法。

  2. Turn your relationship into a node, and create an unique constraint on it. 将您的关系变成一个节点,并在其上创建唯一约束。 But it's hardly necessary for most cases. 但是在大多数情况下几乎没有必要。

If you're having trouble with speed, try using the transactional endpoint. 如果您在速度方面遇到问题,请尝试使用事务性终结点。 I tried importing your data (random cities and neighbourhoods) through IMPORT CSV in 2.2.1, and I it was slow as well, though I am not sure why. 我尝试通过2.2.1中的IMPORT CSV导入您的数据(随机城市和街区),但我也很慢,尽管我不确定为什么。 If you send your queries with parameters to the transactional endpoint in batches of 1000-5000, you can monitor the process, and probably gain a performance boost. 如果将带有参数的查询以1000-5000的批量发送给事务终结点,则可以监视该过程,并可能提高性能。 I managed to import 1M rows in just under 11 minutes. 我在不到11分钟的时间内成功导入了100万行。

I used an INDEX for Neighbourhood(name) and a unique constraint for City(name). 我对邻居(名称)使用了INDEX,对城市(名称)使用了唯一的约束。 Give it a try and see if it works for you. 尝试一下,看看它是否对您有用。

Edit: 编辑:

The transactional endpoint is a restful endpoint that allows you do execute transactions in batch. 事务性终结点是一个宁静的终结点,它允许您批量执行事务。 You can read about it here . 你可以在这里阅读。 Basically, it allows you to stream a bunch of queries to the server at once. 基本上,它允许您一次将一堆查询流式传输到服务器。

I don't know what programming language/stack you're using, but in python, using a package like py2neo, it would be something like this: 我不知道您使用的是哪种编程语言/堆栈,但是在python中,使用像py2neo这样的包,会是这样的:

with open("city.csv", "r") as fp:

    reader = csv.reader(fp)

    queries = []
    template = ["MERGE (c :`City` {name: {city}})",
                "MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]

    statement = '\n'.join(template)

    batch = 5000

    c = 1

    start = time.time()

    for row in reader:

        city, neighborhood = row

        params = dict(city=city, neighborhood=neighborhood)

        queries.append((statement, params))

        if c % batch == 0:

            s = time.time()*1000
            r = neo4j.run_batch_query(queries, 10)
            e = time.time()*1000
            print("\t{0}, {1:.00f}ms".format(c, e-s))
            del queries[:]

        c += 1

    if queries:

        s = time.time()*1000
        r = neo4j.run_batch_query(queries, 300)
        e = time.time()*1000
        print("\t{0} {1:.00f}ms".format(c, e-s))

    end = time.time()

    print("End. {0}s".format(end-start))

Helper functions: 辅助功能:

def run_batch_query(queries, timeout=None):

    if timeout:
        http.socket_timeout = timeout

    try:
        graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
        tx = graph.cypher.begin()

        for query in queries:
            statement, params = query

            tx.append(statement, params)

        results = tx.process()

        tx.commit()

    except http.SocketError as err:
        raise err

    collection = []
    for result in results:

        records = []

        for record in result:

            records.append(record)

        collection.append(records)

    return collection

You will monitor how long each transaction takes, and you can tweak the number of queries per transactions, as well as the timeout. 您将监视每个事务花费的时间,并且可以调整每个事务的查询数量以及超时。

To be sure we're on the same page, this is how I understand your model: Each city is unique and should have some number of neighborhoods pointing to it. 确保我们在同一个页面上,这就是我对您的模型的理解:每个城市都是唯一的,并且应该有一些指向它的社区。 The neighborhoods are unique within the context of a city, but not globally. 在城市环境中,社区是唯一的,但在全球范围内并不是。 So if you have a neighborhood 3 [IN] city Boston , you could also have a neighborhood 3 [IN] city Seattle , and both of those neighborhoods are represented by different nodes, even though they have the same name property. 因此,如果您有3 [IN]邻里城市Boston ,也可能有3 [IN]邻里Seattle城市,并且这两个邻域都由不同的节点表示,即使它们具有相同的name属性。 Is that correct? 那是对的吗?

Before importing, I would recommend adding an index to your neighborhood nodes. 导入之前,我建议向您的邻居节点添加索引。 You can add the index without enforcing uniqueness. 您可以添加索引而无需强制唯一性。 I have found that this greatly increases speeds on even small databases. 我发现即使在小型数据库上,这也大大提高了速度。

CREATE INDEX ON :Neighborhood(name)

And for the import: 对于导入:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})

If you are importing a large amount of data, it may be best to use the USING PERIODIC COMMIT command to commit periodically while importing. 如果要导入大量数据,则最好在导入时USING PERIODIC COMMIT命令进行定期提交。 This will reduce the memory used in the process, and if your server is memory-constrained, I could see it helping performance. 这将减少该过程中使用的内存,并且如果您的服务器内存受限,我可以看到它有助于提高性能。 In your case, with almost a million records, this is recommended by Neo4j. 对于您而言,拥有近一百万条记录,Neo4j建议这样做。 You can even adjust how often the commit happens by doing USING PERIODIC COMMIT 10000 or such. 您甚至可以通过USING PERIODIC COMMIT 10000等来调整提交的频率。 The docs say 1000 is the default. 文档说1000是默认值。 Just understand that this will break the import into several transactions. 只需了解,这会将导入分为几笔交易。

Best of luck! 祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM