[英]Convert dataframe column string values into dummy variable columns
我有以下 dataframe (不包括 rest 的列):
| customer_id | department |
| ----------- | ----------------------------- |
| 11 | ['nail', 'men_skincare'] |
| 23 | ['nail', 'fragrance'] |
| 25 | [] |
| 45 | ['skincare', 'men_fragrance'] |
我正在預處理我的數據以適合 model。 我想將部門變量轉換為每個獨特部門類別的虛擬變量(無論可能有多少獨特的部門,不僅限於這里的內容)。
想要得到這個結果:
| customer_id | department | nail | men_skincare | fragrance | skincare | men_fragrance |
| ----------- | ---------- | ---- | ------------ | --------- | -------- | ------------- |
| 11 | ['nail', 'men_skincare'] | 1 | 1 | 0 | 0 | 0 |
| 23 | ['nail', 'fragrance'] | 1 | 0 | 1 | 0 | 0 |
| 25 | [] | 0 | 0 | 0 | 0 | 0 |
| 45 | ['skincare', 'men_fragrance'] | 0 | 0 | 0 | 1 | 1 |
我試過這個鏈接,但是當我拼接它時,它把它當作一個字符串來對待,並且只為字符串中的每個字符創建一個列; 我用的是什么:
df['1st'] = df['department'].str[0]
df['2nd'] = df['department'].str[1]
df['3rd'] = df['department'].str[2]
df['4th'] = df['department'].str[3]
df['5th'] = df['department'].str[4]
df['6th'] = df['department'].str[5]
df['7th'] = df['department'].str[6]
df['8th'] = df['department'].str[7]
df['9th'] = df['department'].str[8]
df['10th'] = df['department'].str[9]
然后我嘗試使用以下方法拆分字符串並變成一個列表:
df['new_column'] = df['department'].apply(lambda x: x.split(","))
然后再次嘗試,仍然只為每個字符創建列。
有什么建議么?
編輯:我使用 anky 發送過來的鏈接找到了答案,特別是我使用了這個: https://stackoverflow.com/a/29036042
什么對我有用:
df['department'] = df['department'].str.replace("'",'').str.replace("]",'').str.replace("[",'').str.replace(' ','')
df['department'] = df['department'].apply(lambda x: x.split(","))
s = df['department']
df1 = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.merge(df, df1, right_index=True, left_index=True, how = 'left')
import pandas as pd
您可以通過explode()
、 value_counts()
和fillna()
方法做到這一點:
data=df.explode('department').fillna('empty')
現在使用crosstab()
方法:
data=pd.crosstab(data['customer_id'],data['department'])
由於concat()
方法給你一個錯誤,所以使用merge()
方法和drop()
方法:
data=pd.merge(df.set_index('customer_id'),data,left_index=True,right_index=True).drop(columns=['empty'])
現在,如果您打印data
,您將獲得所需的 output:
這是一種使用 sklearn 的MultiLabelBinarizer
基於anky鏈接的快速二值化方法:
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame({'customer_id':{0:11,1:23,2:25,3:45}, 'department':{0:["'nail'","'men_skincare'"], 1:["'nail'","'fragrance'"], 2:[''], 3:["'skincare'","'men_fragrance'"]}})
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(
mlb.fit_transform(df.department),
columns=[c.strip("'") for c in mlb.classes_],
index=df.index,
)).drop(columns='')
# customer_id department fragrance men_fragrance men_skincare nail skincare
# 0 11 ['nail', 'men_skincare'] 0 0 1 1 0
# 1 23 ['nail', 'fragrance'] 1 0 0 1 0
# 2 25 [] 0 0 0 0 0
# 3 45 ['skincare', 'men_fragrance'] 0 1 0 0 1
注意:這假設您的真實數據的department
列包含實際的 python 列表,而不是看起來像列表的字符串。 如果它們實際上是字符串(即type(df.department[0])
輸出str
),則需要首先進行此轉換:
df.department = df.department.str.strip('[]').str.split(r'\s*,\s*')
嘗試:
df.merge(pd.get_dummies(df.set_index('customer_id')
.explode('department'),
prefix='',
prefix_sep='').sum(level=0),
left_on='customer_id', right_index=True)
Output:
customer_id department fragrance men_fragrance men_skincare nail skincare
0 11 [nail, men_skincare] 0 0 1 1 0
1 23 [nail, fragrance] 1 0 0 1 0
2 25 [] 0 0 0 0 0
3 45 [skincare, men_fragrance] 0 1 0 0 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.