简体   繁体   中英

Why does Neo4j hit every indexed record when only returning a count?

I am using version 3.0.3, and running my queries in the shell.

I have ~58 million record nodes with 4 properties each, specifically an ID string, a epoch time integer, and lat/lon floats.

When I run a query like profile MATCH (r:record) RETURN count(r); I get a very quick response:

+----------+
| count(r) |
+----------+
| 58430739 |
+----------+
1 row
29 ms

Compiler CYPHER 3.0

Planner COST

Runtime INTERPRETED

+--------------------------+----------------+------+---------+-----------+--------------------------------+
| Operator                 | Estimated Rows | Rows | DB Hits | Variables | Other                          |
+--------------------------+----------------+------+---------+-----------+--------------------------------+
| +ProduceResults          |           7644 |    1 |       0 | count(r)  | count(r)                       |
| |                        +----------------+------+---------+-----------+--------------------------------+
| +NodeCountFromCountStore |           7644 |    1 |       0 | count(r)  | count( (:record) ) AS count(r) |
+--------------------------+----------------+------+---------+-----------+--------------------------------+

Total database accesses: 0

The Total database accesses: 0 and NodeCountFromCountStore tells me that neo4j uses a counting mechanism here that avoids iterating over all the nodes.

However, when I run profile MATCH (r:record) WHERE r.time < 10000000000 RETURN count(r); , I get a very slow response:

+----------+
| count(r) |
+----------+
| 58430739 |
+----------+
1 row
151278 ms

Compiler CYPHER 3.0

Planner COST

Runtime INTERPRETED

+-----------------------+----------------+----------+----------+-----------+------------------------------+
| Operator              | Estimated Rows | Rows     | DB Hits  | Variables | Other                        |
+-----------------------+----------------+----------+----------+-----------+------------------------------+
| +ProduceResults       |           1324 |        1 |        0 | count(r)  | count(r)                     |
| |                     +----------------+----------+----------+-----------+------------------------------+
| +EagerAggregation     |           1324 |        1 |        0 | count(r)  |                              |
| |                     +----------------+----------+----------+-----------+------------------------------+
| +NodeIndexSeekByRange |        1752922 | 58430739 | 58430740 | r         | :record(time) < {  AUTOINT0} |
+-----------------------+----------------+----------+----------+-----------+------------------------------+

Total database accesses: 58430740

The count is correct, as I chose a time value larger than all of my records. What surprises me here is that Neo4j is accessing EVERY single record. The profiler states that Neo4j is using the NodeIndexSeekByRange as an alternative method here.

My question is, why does Neo4j access EVERY record when all it is returning is a count? Are there no intelligent mechanisms inside the system to count a range of values after seeking the boundary/threshold value within the index?

I use Apache Solr for the same data, and returning a count after searching an index is extremely fast (about 5 seconds). If I recall correctly, both platforms are built on top of Apache Lucene. While I don't know much about that software internally, I would assume that the index support is fairly similar for both Neo4j and Solr.

I am working on a proxy service that will deliver results in a paginated form (using the SKIP n LIMIT m technique) by first getting a count, and then iterating over results in chunks. This works really well for Solr, but I am afraid that Neo4j may not perform well in this scenario.

Any thoughts?

The later query does a NodeIndexSeekByRange operation. This is going through all your matched nodes with the record label to look up the value of the node property time and does a comparison if its value is less than 10000000000 .

This query actually has to get every node and read some info for comparison, and that's the reason why it is much slower.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM