读入 csv 并根据每一列使用不同的分隔符？

Question

Column id and runtime are comma-separated.列 ID 和运行时以逗号分隔。 However, column genres is separated by Pipe(|).但是，列类型由竖线 (|) 分隔。 df = pd.read_csv(path, sep=',') results in the table below. df = pd.read_csv(path, sep=',')结果见下表。 However, I can't conduct any queries on column genres , for instance finding the most popular genre by year?但是，我无法对列genres进行任何查询，例如按年份查找最受欢迎的流派？ Is it possible to separate pipe into separate rows?是否可以将 pipe 分成单独的行？

df.head()
    id  runtime genres                                       Year
0   135397  124 Action|Adventure|Science Fiction|Thriller    2000
1   76341   120 Action|Adventure|Science Fiction|Thriller    2002
2   262500  119 Adventure|Science Fiction|Thriller           2001
3   140607  136 Action|Adventure|Science Fiction|Fantasy     2000
4   168259  137 Action|Crime|Thriller                        1999

Answer 1

You're better reading the file as is, then split the genres into new rows with pandas explode :您最好按原样阅读文件，然后使用 pandas explode将流派拆分为新行：

df = df.assign(genres = df.genres.str.split('|')).explode('genres')

so that you can easily manipulate your data.这样您就可以轻松地操作您的数据。

For example, to get the most frequent ( ie mode) genres per year:例如，要获取每年最频繁（即模式）的流派：

df.groupby('Year').genres.apply(lambda x: x.mode()).droplevel(1)

To identify the counts:要识别计数：

def get_all_max(grp):
    counts = grp.value_counts()
    return counts[counts==counts.max()]

df.groupby('Year').genres.apply(get_all_max)\
.rename_axis(index={None:'Genre'}).to_frame(name='Count')

读入 csv 并根据每一列使用不同的分隔符？

问题描述

1 个解决方案

解决方案1
1 2020-12-06 23:04:33

读入 csv 并根据每一列使用不同的分隔符？

问题描述

1 个解决方案

解决方案1 1 2020-12-06 23:04:33

解决方案1
1 2020-12-06 23:04:33