简体   繁体   中英

Parsing CSV file with Pandas in Python 3

I am trying to parse a movie database with Python 3. How can I parse genres of a movie with different variables? For example:

1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy

First value is movie_id, second is movie_name, and the third values are genres but I want to parse them as separate variables that belong to corresponding movie. In other words, I want second separator to my database as "|". How can I achieve this? Here is my code:

import numpy as np
import pandas as pd
header = ["movie_id", "title", "genres"]
movie_db = pd.read_csv("movielens/movies.csv", sep=",", names=header)

You can use separator ,| but is necessary first row have to contains all possible genres:

df = pd.read_csv("movielens/movies.csv", sep="[,|]", header=None, engine='python')
print (df)
   0                 1          2          3         4       5        6
0  1  Toy Story (1995)  Adventure  Animation  Children  Comedy  Fantasy
1  2    Jumanji (1995)  Adventure   Children   Fantasy    None     None

But here is better create new columns by categories and set to 1 if category exist in row by get_dummies and add to original columns by join :

movie_db = pd.read_csv("movielens/movies.csv", sep=",", names=header)
df =  movie_db.join(movie_db.pop('genres').str.get_dummies())
print (df)
   movie_id             title  Adventure  Animation  Children  Comedy  Fantasy
0         1  Toy Story (1995)          1          1         1       1        1
1         2    Jumanji (1995)          1          0         1       0        1

But if need columns is possible use split by | :

df = movie_db.join(movie_db.pop('genres').str.split('|', expand=True))
print (df)
   movie_id             title          0          1         2       3        4
0         1  Toy Story (1995)  Adventure  Animation  Children  Comedy  Fantasy
1         2    Jumanji (1995)  Adventure   Children   Fantasy    None     None

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM