简体   繁体   中英

How to do better text search in neo4j

I have two types of nodes Article and TAG where TAG have two properties id and name. Now I want to search all the articles according to tags.

(a : Article)-[:TAGGED]->(t : TAG) 

eg If I have tags like "i love my country" and my query string is "country" then search is successfully using the following query.

Match (a : Article)-[:TAGGED]->(t : TAG) 
where t.name =~ '*.country.*' 
return a;

But its vice-versa is not possible like if my tag is "country" and I search for "i love my country" then it should also display the articles related to country. It should also handle the case when user have entered more than one space between the two words. On searching I came accross lucene and solr but I don't know how to use them. And I am using PHP as my coding language.

[EDITED]

Original Answer

This should work for you:

MATCH (a: Article)-[:TAGGED]->(t:TAG)
WHERE ANY(word IN FILTER(x IN SPLIT({searchString}, " ") WHERE x <> '') 
  WHERE t.name CONTAINS word)
RETURN a;

{searchString} is your search string, with one or spaces separating words; eg:

"i  love my    country"

This snippet generates a collection of the non-empty words in {searchString} :

FILTER(x IN SPLIT({searchString}, " ") WHERE x <> '')

Improved Answer

This query matches on words (eg, if the query string is "i love you", the "i" will only match "i" or "I" as a word in the tag, not just any letter "i"). It is also case insensitive.

WITH REDUCE(res = [], w IN SPLIT({searchString}, " ") |
  CASE WHEN w <> '' THEN res + ("(?i).*\\b" + w + "\\b.*") ELSE res END) AS res
MATCH (a: Article)-[:TAGGED]->(t:TAG)
WHERE ANY (regexp IN res WHERE t.name =~ regexp)
RETURN a;

The REDUCE clause generates a collection of words from {searchString} , each surrounded by "(?i).*\\b" and "\\b.*" to become a regular expression for doing a case insensitive search with word boundaries.

NOTE: the backslashes ( "\\" ) in the regular expression actually have to be doubled-up because the backslash is an escape charater.

Neo4j uses Lucene indices internally for fulltext search.

Based on this page from the user guide, it appears that the default indexing 'type' is exact using the Lucene Keyword Analyzer which doesn't tokenize the input.

What that means, is that without changing this indexing setting you can only run queries that match the entire tag name (in the case of your example, you're running a wildcard query '*.country.*' which matches the whole tag string).

What I think you actually want, based on your stated requirements is tokenization in whitespace ( type=fulltext ) at the time you insert the graph data , so that the tag field actually contains one token per word: 1-i 2-love 3-my 4-country, any one of which can match a query term (without needing wildcards: eg "country" or "I love my chocolate")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM