简体   繁体   中英

Merge CSV in neo4j avoiding duplicates

I am Working with neo4j and an example Data-set of Movies. https://neo4j.com/developer/example-data/

I now want to import and update my database with an CSV file of grouplens https://grouplens.org/datasets/movielens/

If the movie written in the csv-file is already in the database i want to update (merge) the properties retrieved from the csv file.

If the movie is not already in the database, i want to create a new record for this.

One problem is, that the movies in the csv-file have the release-Year within the Title and the entries in the DB dont. Therefore, i additionally need to split the title in the csv-file.

I tried this, but it doesent work :

 USING PERIODIC COMMIT 500
 LOAD CSV WITH HEADERS FROM "File:///movies.csv" AS csvLine 
 Merge (movie:Movie{title:split(csvLine.title,"()")})
 Create (a:Movie{id:csvLine.movieId,genre:csvLine.genres})
 Return a.title

I suggest you install the APOC procedures plugin which contains a lot of useful string functions to help you :

https://github.com/neo4j-contrib/neo4j-apoc-procedures

Download the APOC plugin and copy it in your $NEO4J_HOME/plugins directory

Enable the procedures in your neo4j.conf configuration file by adding this line

dbms.security.procedures.unrestricted=apoc.*

Restart Neo4j.

The APOC procedures contain an apoc.text.replace function which accepts a regex :

WITH "Toy Story (1995)" AS title
RETURN trim(apoc.text.replace(title, "\\([0-9]+\\)",""))


╒═══════════════════════════════════════════════════════════╕
│"trim(apoc.text.replace(title, \"\\\\([0-9]+\\\\)\",\"\"))"│
╞═══════════════════════════════════════════════════════════╡
│"Toy Story"                                                │
└───────────────────────────────────────────────────────────┘

You can then use it in your LOAD CSV statement :

 USING PERIODIC COMMIT 500
 LOAD CSV WITH HEADERS FROM "File:///movies.csv" AS csvLine 
 MERGE (movie:Movie {title: trim(apoc.text.replace(csvLine.title, "\\([0-9]+\\)","")) })
 CREATE (a:Movie{id:csvLine.movieId,genre:csvLine.genres})
 RETURN a.title

When a query isn't working as expected, it's best to develop a minimal query to check your assumptions and see what's going wrong.

For example, you might use this to test whether your split is working as intended:

LOAD CSV WITH HEADERS FROM "File:///movies.csv" AS csvLine 
with csvLine limit 1
return csvLine.title, split(csvLine.title,"()")

You'll be able to quickly see that the split isn't doing what you think it is, multiple delimiters here don't work, and even if they did you would need to do more in your query to get the relevant part of the split result.

One way to get what you want is to split around '(', get the first element of the resulting string array, and then right-trim it to get rid of any possible white space: rTrim(split(csvLine.title, '(')[0])

The other thing that needs to be addressed is to figure out what you're trying to do in that CREATE clause after your MERGE. I have a feeling you just wanted a single MERGE, and then wanted to SET values afterward (otherwise you're creating two separate :Movie nodes here, one for the title, and another for the id and genre, which doesn't make sense).

Try this:

USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "File:///movies.csv" AS csvLine 
MERGE (movie:Movie{title: rTrim(split(csvLine.title, '(')[0])})
SET movie.id = toInteger(csvLine.movieId), movie.genres = csvLine.genres

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM