Python pandas - 提取多值属性

Question

I just started with Python and wanted to do data preparation with the numpy/pandas package on the Movielens dataset (especially the the file with MovieID, Movie Name and Year as well as Genre).我刚开始使用 Python，想在 Movielens 数据集上使用 numpy/pandas 包进行数据准备（尤其是包含 MovieID、电影名称和年份以及流派的文件）。

Screenshot: movielens - movie dataset截图： movielens - 电影数据集

The column Genre is a multi-value column which is a problem for me since I want to try using machine learning algorithms on the datasets. Genre 列是一个多值列，这对我来说是个问题，因为我想尝试在数据集上使用机器学习算法。

Aim: I want to have a yes/no or 0/1 information about which genre the movie falls in and which not.目标：我想要关于电影属于哪种类型以及哪些类型不属于的信息是/否或 0/1。

Idea: Check if the 'Genre' column contains the column name of the appended columns (single genre names).想法：检查“流派”列是否包含附加列的列名（单一流派名称）。 If so, write yes, otherwise write now in the cell.如果是这样，写是，否则现在写在单元格中。 And this iterate over all the new columns and all the rows.这将迭代所有新列和所有行。

Done so far: I appended empty/NaN columns to the dataframe for each Genre.到目前为止完成：我将空/NaN 列附加到每个流派的数据框。 And I also tried with dataframe.iloc['Genre'].str.contains(list(dataframe)[4]) which gave me the result TRUE or FALSE if the names matched or not.而且我还尝试了dataframe.iloc['Genre'].str.contains(list(dataframe)[4])如果名称匹配或不匹配，它会给我结果 TRUE 或 FALSE。 But how can I iterate and write in the cells in an elegant way?但是如何以优雅的方式在单元格中迭代和写入？

Many thanks in advance.非常感谢提前。 Best, Marcel最好的，马塞尔

EDIT: Here you will find what I achieved so far.编辑：在这里你会发现我到目前为止所取得的成就。 I split the data in the Genre column with the pipe separator, renamed the columns and appended the new columns and deleted the old column.我使用管道分隔符拆分 Genre 列中的数据，重命名列并附加新列并删除旧列。 If I now use the get_dummies function on all the columns, it creates eg a 'Genre1_Action', 'Genre1_Adventure', ..., 'Genre3Thriller', according to the text values displayed in the cell of the Genre cells.如果我现在在所有列上使用get_dummies函数，它会根据流派单元格的单元格中显示的文本值创建例如“Genre1_Action”、“Genre1_Adventure”、...、“Genre3Thriller”。 What I want to achieve is that each Genre gets its single columns for each movie.我想要实现的是，每个流派为每部电影都有自己的单列。

# create a small test subset
subset1 = movie_data [0:9]
print("Original Dataset")
print(subset1)
# Split movie year and year in separate values -> append them to the df -> clean the Year column
tempY = subset1['MovieNameYear'].str.split('(').apply(pd.Series)
tempY.columns = ['MovieName','Year']
subset1 = pd.concat([subset1,tempY], axis=1, join='inner')
subset1['Year'] = subset1['Year'].str.replace(')','')
del subset1['MovieNameYear']

# split the column 'Genre' with the with the pipe separator in seperate columns
# name the columns of the temp value with the splitted values
# join the through split created columns to the existing subset and delete the original multi value column
tempG = subset1['Genre'].str.split('|').apply(pd.Series)
tempG.columns = ['Genre1','Genre2','Genre3']
subset1 = pd.concat([subset1, tempG], axis=1, join='inner')
del subset1['Genre']
print("Cleaned Dataset")
print(subset1)

dummiesTemp = pd.get_dummies(data=subset1, columns=['Genre1','Genre2','Genre3'])
print(dummiesTemp)

Answer 1

If I understand you well, you want to have a column per genre, indicating T/F.如果我理解你的话，你希望每个流派都有一个列，表示 T/F。 I would advice you to look at the get_dummies function我建议你看看get_dummies函数

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)

Update - if you have columns with double values, you can split them before or after.更新 - 如果您有具有双值的列，您可以在之前或之后拆分它们。 Example of splitting after (would guess its quickest, but one should test).之后拆分的示例（猜测它是最快的，但应该测试）。 Code could be prettier, but hope its clear.代码可以更漂亮，但希望它清楚。

import pandas as pd
import numpy as np

s = pd.Series(['a', 'b', 'c', 'a|b', 'a|d'])
d = pd.get_dummies(s)

columns = list(d)
for col in columns:
    if '|' in col:
        for l in col.split('|'):
            if l in columns:
                d[l] = np.maximum(d[l].values, d[col].values)
            else:
                d[l] = d[col]

Answer 2

This actually should be a comment but lack of reputation :').这实际上应该是一个评论，但缺乏声誉:')。 Here I got a decent answer for this.在这里，我得到了一个不错的答案。

In short总之

dummies = df.genres.str.get_dummies('|')

this will give you a DataFrame containing the one-hot encoding output.这将为您提供一个包含单热编码输出的 DataFrame。

Then you may join this to the original df by :然后您可以通过以下方式将其加入原始df ：

df = df.join(dummies)

Python pandas - 提取多值属性

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-04-10 15:52:13

解决方案2
0 2020-07-05 14:33:21

Python pandas - 提取多值属性

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-04-10 15:52:13

解决方案2 0 2020-07-05 14:33:21

解决方案1
1 已采纳 2018-04-10 15:52:13

解决方案2
0 2020-07-05 14:33:21