简体   繁体   English

读入 csv 并根据每一列使用不同的分隔符?

[英]Read in a csv with a different separator based on the each column?

Column id and runtime are comma-separated.列 ID 和运行时以逗号分隔。 However, column genres is separated by Pipe(|).但是,列类型由竖线 (|) 分隔。 df = pd.read_csv(path, sep=',') results in the table below. df = pd.read_csv(path, sep=',')结果见下表。 However, I can't conduct any queries on column genres , for instance finding the most popular genre by year?但是,我无法对列genres进行任何查询,例如按年份查找最受欢迎的流派? Is it possible to separate pipe into separate rows?是否可以将 pipe 分成单独的行?

df.head()
    id  runtime genres                                       Year
0   135397  124 Action|Adventure|Science Fiction|Thriller    2000
1   76341   120 Action|Adventure|Science Fiction|Thriller    2002
2   262500  119 Adventure|Science Fiction|Thriller           2001
3   140607  136 Action|Adventure|Science Fiction|Fantasy     2000
4   168259  137 Action|Crime|Thriller                        1999

You're better reading the file as is, then split the genres into new rows with pandas explode :您最好按原样阅读文件,然后使用 pandas explode将流派拆分为新行:

df = df.assign(genres = df.genres.str.split('|')).explode('genres')

so that you can easily manipulate your data.这样您就可以轻松地操作您的数据。


For example, to get the most frequent ( ie mode) genres per year:例如,要获取每年最频繁(模式)的流派:

df.groupby('Year').genres.apply(lambda x: x.mode()).droplevel(1)

To identify the counts:要识别计数:

def get_all_max(grp):
    counts = grp.value_counts()
    return counts[counts==counts.max()]

df.groupby('Year').genres.apply(get_all_max)\
.rename_axis(index={None:'Genre'}).to_frame(name='Count')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM