简体   繁体   中英

Excel - combining two csv files into one with a common column

I am working with two datasets in csv form (movielens latest-small dataset). Given below are the fields of both.

rating.csv

  user_id   movie_id  rating

movie.csv

  movie_id   movie_name

what I want is to combine them into a single .csv with following fields

user_id   movie_id   movie_name   rating

So that the common column movie_id maps with corresponding movie_name .

Could that be done using Excel? If not, how can I do it?

I just need it as a dataset for my recommender engine, so any simple solution is welcome as end result is all that matters. But since I've some experience in java so that would be easy for my easy understand and implement.

If there is some way using Excel then that would be the best. I have tried searching online and found some VLOOKUP method but couldn't clearly get it.

Also I tried some online merging tools but they just attached the sheets one after the another not mapping the column. So I have no problem using online tools too.

This is the method with a VLOOKUP formula within Excel:

在此处输入图片说明

The formula takes 4 arguments:

  1. The value you are wanting to look up
  2. The range of data you are looking into
  3. The column within (2) that contains the answer you want
  4. Whether to match on (1) approximately ie FALSE = exact match

See here for documentation on the function.

Check out this tool - https://github.com/DataFoxCo/gocsv - it's based off of csvkit but has a ton of additional features. One of our engineers custom built it - and open sourced it to help solve some of these data issues we deal with every day :)

It will do a vlookup essentially of any sized csv in merely seconds using the join command:

gocsv join --columns 'movie_id','movie_id' --left rating.csv movie.csv > combineddata.csv

then if you still want to reorder the columns, you can do that too:

gocsv select --columns 'user_id','movie_id','movie_name','rating' combineddata.csv > combineddata-final.csv

I split the commands out up top to help explain its use - the documentation has all the examples on it also but ultimately I would really recommend pipelining it and doing it in one command like this:

cat rating.csv \\ | gocsv join --left --columns 'movie_id','movie_id' movie.csv \\ | gocsv select --columns 'user_id','movie_id','movie_name','rating' > combineddata.csv

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM