I'm trying to find out the number of users that have all the necessary skills to qualify for an occupation. Users can have many skills, and I want to return all the qualified users per job.
Here's my current query:
MATCH (:User)-[:has_skill]->(:Skill)<-[:requires]-(o:Occupation)
WITH DISTINCT o
MATCH (o)
WITH o, SIZE((o)-[:requires]->()) AS occupation_skill_count
MATCH (o)-[:requires]->(:Skill)<-[hs:has_skill]-(u:User)
WITH o, u, occupation_skill_count, count(hs) AS user_skill_count
WHERE occupation_skill_count = user_skill_count
WITH o.title as occupation_title, count(u) as users_count
RETURN occupation_title, users_count
However, I'm concerned that my query is not efficient, since it times out (there are over 60,000 occupations, 10,000 users, and 2,500 skills) . I want to know if there's a better way to write this query.
My approach in writing this query is,
This seems to work in staging environment, where the records are much less. However it will just time out in prod as there are too many data. Is there a better way to write this?
For performance issues, it helps to show the PROFILE plan of the query. If you could expand all elements of the plan and paste it into your description, that could help identify where the query can be improved.
Since you're performing this for all occupations, it's a good candidate for batching. However, since batching won't be able to return the counts (it's used for write operations), we can instead use it to write the counts to the :Occupation nodes so we can query for these numbers fast after we're done computing them. At that point it's up to you if you want to keep the calculated properties (maybe with a timestamp of when they were calculated), or simply report on them and remove the properties immediately.
You'll need APOC Procedures for performing the batching operation. apoc.periodic.iterate()
will be the procedure of choice (you can adjust the batchSize to whatever works best for you). I'll add comments inline.
CALL apoc.periodic.iterate(
// iterate in batches for all :Occupations
"MATCH (o:Occupation) RETURN o",
// for each occupation, get all skills in ascending order of skilled users
"MATCH (o)-[:requires]->(s:Skill)
WITH o, s, size((s)<-[:has_skill]-()) as skilledUserCount
WHERE skilledUserCount <> 0
ORDER BY skilledUserCount ASC
WITH o, collect(s) as skills
WITH o, head(skills) as first, tail(skills) as skills
// get users with all the required skills
// because of ordering, we start with the smallest set of skilled users
MATCH (first)<-[:has_skill]-(u)
WHERE ALL(skill in skills WHERE (skill)<-[:has_skill]-(u))
// now set this count of users with all skills to the occupation
WITH o, count(u) as skilledUsers
SET o.skilledUsers = skilledUsers
// uncomment next line to keep a timestamp of when this was last updated
// SET o.skilledUsersUpdated = timestamp()
",
{batchSize:1000, parallel:true, iterateList:true}) YIELD batches, total
RETURN batches, total
Once this finishes, all occupations should have their number of skilled users for easy querying:
MATCH (o:Occupation)
RETURN o.title as occupation_title, o.skilledUsers as users_count
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.