[英]Read in a csv with a different separator based on the each column?
Column id and runtime are comma-separated.列 ID 和运行时以逗号分隔。 However, column genres is separated by Pipe(|).
但是,列类型由竖线 (|) 分隔。
df = pd.read_csv(path, sep=',')
results in the table below. df = pd.read_csv(path, sep=',')
结果见下表。 However, I can't conduct any queries on column genres , for instance finding the most popular genre by year?但是,我无法对列genres进行任何查询,例如按年份查找最受欢迎的流派? Is it possible to separate pipe into separate rows?
是否可以将 pipe 分成单独的行?
df.head()
id runtime genres Year
0 135397 124 Action|Adventure|Science Fiction|Thriller 2000
1 76341 120 Action|Adventure|Science Fiction|Thriller 2002
2 262500 119 Adventure|Science Fiction|Thriller 2001
3 140607 136 Action|Adventure|Science Fiction|Fantasy 2000
4 168259 137 Action|Crime|Thriller 1999
You're better reading the file as is, then split the genres into new rows with pandas explode
:您最好按原样阅读文件,然后使用 pandas
explode
将流派拆分为新行:
df = df.assign(genres = df.genres.str.split('|')).explode('genres')
so that you can easily manipulate your data.这样您就可以轻松地操作您的数据。
For example, to get the most frequent ( ie mode) genres per year:例如,要获取每年最频繁(即模式)的流派:
df.groupby('Year').genres.apply(lambda x: x.mode()).droplevel(1)
To identify the counts:要识别计数:
def get_all_max(grp):
counts = grp.value_counts()
return counts[counts==counts.max()]
df.groupby('Year').genres.apply(get_all_max)\
.rename_axis(index={None:'Genre'}).to_frame(name='Count')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.