简体   繁体   English

Python pandas - 提取多值属性

[英]Python pandas - extracting multi-value attributes

I just started with Python and wanted to do data preparation with the numpy/pandas package on the Movielens dataset (especially the the file with MovieID, Movie Name and Year as well as Genre).我刚开始使用 Python,想在 Movielens 数据集上使用 numpy/pandas 包进行数据准备(尤其是包含 MovieID、电影名称和年份以及流派的文件)。

Screenshot: movielens - movie dataset截图: movielens - 电影数据集

The column Genre is a multi-value column which is a problem for me since I want to try using machine learning algorithms on the datasets. Genre 列是一个多值列,这对我来说是个问题,因为我想尝试在数据集上使用机器学习算法。

Aim: I want to have a yes/no or 0/1 information about which genre the movie falls in and which not.目标:我想要关于电影属于哪种类型以及哪些类型不属于的信息是/否或 0/1。

Idea: Check if the 'Genre' column contains the column name of the appended columns (single genre names).想法:检查“流派”列是否包含附加列的列名(单一流派名称)。 If so, write yes, otherwise write now in the cell.如果是这样,写是,否则现在写在单元格中。 And this iterate over all the new columns and all the rows.这将迭代所有新列和所有行。

Done so far: I appended empty/NaN columns to the dataframe for each Genre.到目前为止完成:我将空/NaN 列附加到每个流派的数据框。 And I also tried with dataframe.iloc['Genre'].str.contains(list(dataframe)[4]) which gave me the result TRUE or FALSE if the names matched or not.而且我还尝试了dataframe.iloc['Genre'].str.contains(list(dataframe)[4])如果名称匹配或不匹配,它会给我结果 TRUE 或 FALSE。 But how can I iterate and write in the cells in an elegant way?但是如何以优雅的方式在单元格中迭代和写入?

Many thanks in advance.非常感谢提前。 Best, Marcel最好的,马塞尔

EDIT: Here you will find what I achieved so far.编辑:在这里你会发现我到目前为止所取得的成就。 I split the data in the Genre column with the pipe separator, renamed the columns and appended the new columns and deleted the old column.我使用管道分隔符拆分 Genre 列中的数据,重命名列并附加新列并删除旧列。 If I now use the get_dummies function on all the columns, it creates eg a 'Genre1_Action', 'Genre1_Adventure', ..., 'Genre3Thriller', according to the text values displayed in the cell of the Genre cells.如果我现在在所有列上使用get_dummies函数,它会根据流派单元格的单元格中显示的文本值创建例如“Genre1_Action”、“Genre1_Adventure”、...、“Genre3Thriller”。 What I want to achieve is that each Genre gets its single columns for each movie.我想要实现的是,每个流派为每部电影都有自己的单列。

# create a small test subset
subset1 = movie_data [0:9]
print("Original Dataset")
print(subset1)
# Split movie year and year in separate values -> append them to the df -> clean the Year column
tempY = subset1['MovieNameYear'].str.split('(').apply(pd.Series)
tempY.columns = ['MovieName','Year']
subset1 = pd.concat([subset1,tempY], axis=1, join='inner')
subset1['Year'] = subset1['Year'].str.replace(')','')
del subset1['MovieNameYear']

# split the column 'Genre' with the with the pipe separator in seperate columns
# name the columns of the temp value with the splitted values
# join the through split created columns to the existing subset and delete the original multi value column
tempG = subset1['Genre'].str.split('|').apply(pd.Series)
tempG.columns = ['Genre1','Genre2','Genre3']
subset1 = pd.concat([subset1, tempG], axis=1, join='inner')
del subset1['Genre']
print("Cleaned Dataset")
print(subset1)

dummiesTemp = pd.get_dummies(data=subset1, columns=['Genre1','Genre2','Genre3'])
print(dummiesTemp)

If I understand you well, you want to have a column per genre, indicating T/F.如果我理解你的话,你希望每个流派都有一个列,表示 T/F。 I would advice you to look at the get_dummies function我建议你看看get_dummies函数

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)

Update - if you have columns with double values, you can split them before or after.更新 - 如果您有具有双值的列,您可以在之前或之后拆分它们。 Example of splitting after (would guess its quickest, but one should test).之后拆分的示例(猜测它是最快的,但应该测试)。 Code could be prettier, but hope its clear.代码可以更漂亮,但希望它清楚。

import pandas as pd
import numpy as np

s = pd.Series(['a', 'b', 'c', 'a|b', 'a|d'])
d = pd.get_dummies(s)

columns = list(d)
for col in columns:
    if '|' in col:
        for l in col.split('|'):
            if l in columns:
                d[l] = np.maximum(d[l].values, d[col].values)
            else:
                d[l] = d[col]

This actually should be a comment but lack of reputation :').这实际上应该是一个评论,但缺乏声誉:')。 Here I got a decent answer for this.在这里,我得到了一个不错的答案。

In short总之

dummies = df.genres.str.get_dummies('|') 

this will give you a DataFrame containing the one-hot encoding output.这将为您提供一个包含单热编码输出的 DataFrame。

Then you may join this to the original df by :然后您可以通过以下方式将其加入原始df

df = df.join(dummies)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM