Cypher/Neo4j: How to match nodes that have relationship to all related nodes

Question

I'm trying to find out the number of users that have all the necessary skills to qualify for an occupation. Users can have many skills, and I want to return all the qualified users per job.

Here's my current query:

  MATCH (:User)-[:has_skill]->(:Skill)<-[:requires]-(o:Occupation)
  WITH DISTINCT o
  MATCH (o)
  WITH o, SIZE((o)-[:requires]->()) AS occupation_skill_count
  MATCH (o)-[:requires]->(:Skill)<-[hs:has_skill]-(u:User)
  WITH o, u, occupation_skill_count, count(hs) AS user_skill_count
  WHERE occupation_skill_count = user_skill_count
  WITH o.title as occupation_title, count(u) as users_count
  RETURN occupation_title, users_count

However, I'm concerned that my query is not efficient, since it times out (there are over 60,000 occupations, 10,000 users, and 2,500 skills) . I want to know if there's a better way to write this query.

My approach in writing this query is,

Match all the occupations that are connected to user through skill.
Count the number of required skills for all those occupations.
Match all the users that are connected to those occupations through skill, where the number of skills that the user has to that occupation equals the number of all the required skills that the occupation requires.

This seems to work in staging environment, where the records are much less. However it will just time out in prod as there are too many data. Is there a better way to write this?

Answer 1

For performance issues, it helps to show the PROFILE plan of the query. If you could expand all elements of the plan and paste it into your description, that could help identify where the query can be improved.

Since you're performing this for all occupations, it's a good candidate for batching. However, since batching won't be able to return the counts (it's used for write operations), we can instead use it to write the counts to the :Occupation nodes so we can query for these numbers fast after we're done computing them. At that point it's up to you if you want to keep the calculated properties (maybe with a timestamp of when they were calculated), or simply report on them and remove the properties immediately.

You'll need APOC Procedures for performing the batching operation. apoc.periodic.iterate() will be the procedure of choice (you can adjust the batchSize to whatever works best for you). I'll add comments inline.

CALL apoc.periodic.iterate(
 // iterate in batches for all :Occupations
 "MATCH (o:Occupation) RETURN o",
 // for each occupation, get all skills in ascending order of skilled users
 "MATCH (o)-[:requires]->(s:Skill)
 WITH o, s, size((s)<-[:has_skill]-()) as skilledUserCount
 WHERE skilledUserCount <> 0
 ORDER BY skilledUserCount ASC
 WITH o, collect(s) as skills
 WITH o, head(skills) as first, tail(skills) as skills
 // get users with all the required skills
 // because of ordering, we start with the smallest set of skilled users
 MATCH (first)<-[:has_skill]-(u)
 WHERE ALL(skill in skills WHERE (skill)<-[:has_skill]-(u))
 // now set this count of users with all skills to the occupation
 WITH o, count(u) as skilledUsers
 SET o.skilledUsers = skilledUsers
 // uncomment next line to keep a timestamp of when this was last updated
 // SET o.skilledUsersUpdated = timestamp()
 ",
 {batchSize:1000, parallel:true, iterateList:true}) YIELD batches, total
 RETURN batches, total

Once this finishes, all occupations should have their number of skilled users for easy querying:

MATCH (o:Occupation)
RETURN o.title as occupation_title, o.skilledUsers as users_count

Cypher/Neo4j: How to match nodes that have relationship to all related nodes

Question

1 answers

solution1
0 2017-04-18 15:45:58

Cypher/Neo4j: How to match nodes that have relationship to all related nodes

Question

1 answers

solution1 0 2017-04-18 15:45:58

solution1
0 2017-04-18 15:45:58