[英]Make new pandas columns based on pipe-delimited column with possible repeats
这个问题适用于上一个问题的精细解决方案, 在Pandas中基于管道定界列创建多个新列
我有一个管道分隔列,我想转换为多个新列,计算每行的管道字符串中元素的出现。 我已经获得了一个解决方案,除了在相关列中有空单元格的行之外,它会留下NaN /空白而不是0。 除了后验NaN-> 0转换,有没有办法增加当前的解决方案?
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([
[1202, 2007, 99.34,None],
[9321, 2009, 61.21,'12|34'],
[3832, 2012, 12.32,'12|12|34'],
[1723, 2017, 873.74,'28|13|51']]),
columns=['ID', 'YEAR', 'AMT','PARTS'])
part_dummies = df1.PARTS.str.get_dummies().add_prefix('Part_')
print(pd.concat([df1, part_dummies], axis=1, join_axes=[df1.index]))
# Expected Output:
# ID YEAR AMT PART_12 PART_34 PART_28 PART_13 PART_51
# 1202 2007 99.34 0 0 0 0 0
# 9321 2009 61.21 1 1 0 0 0
# 3832 2012 12.32 2 1 0 0 0
# 1723 2017 873.74 0 0 1 1 1
# Actual Output:
# ID YEAR AMT PART_12 PART_34 PART_28 PART_13 PART_51
# 1202 2007 99.34 0 0 0 0 0
# 9321 2009 61.21 1 1 0 0 0
# 3832 2012 12.32 1 1 0 0 0
# 1723 2017 873.74 0 0 1 1 1
part_dummies = pd.get_dummies(df1.PARTS.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')
print(pd.concat([df1, part_dummies], axis=1, join_axes=[df1.index]))
# ID YEAR AMT PART_12 PART_13 PART_28 PART_34 PART_51
# 1202 2007 99.34 NaN NaN NaN NaN NaN
# 9321 2009 61.21 1 0 0 1 0
# 3832 2012 12.32 2 0 0 1 0
# 1723 2017 873.74 0 1 1 0 1
stack
正在丢弃NaNs。 使用dropna=False
将解决这个问题:
pd.get_dummies(df1.set_index(['ID','YEAR','AMT']).PARTS.str.split('|', expand=True)\
.stack(dropna=False), prefix='Part')\
.sum(level=0)
输出:
Part_12 Part_13 Part_28 Part_34 Part_51
ID
1202 0 0 0 0 0
9321 1 0 0 1 0
3832 2 0 0 1 0
1723 0 1 1 0 1
你可以使用sklearn.feature_extraction.text.CountVectorizer :
In [22]: from sklearn.feature_extraction.text import CountVectorizer
In [23]: cv = CountVectorizer()
In [24]: t = pd.DataFrame(cv.fit_transform(df1.PARTS.fillna('').str.replace(r'\|', ' ')).A,
...: columns=cv.get_feature_names(),
...: index=df1.index).add_prefix('PART_')
...:
In [25]: df1 = df1.join(t)
In [26]: df1
Out[26]:
ID YEAR AMT PARTS PART_12 PART_13 PART_28 PART_34 PART_51
0 1202 2007 99.34 None 0 0 0 0 0
1 9321 2009 61.21 12|34 1 0 0 1 0
2 3832 2012 12.32 12|12|34 2 0 0 1 0
3 1723 2017 873.74 28|13|51 0 1 1 0 1
使用这个扩展版本 - 也应该工作; 此外,还将保留原始列
In [728]: import pandas as pd
# Dataframe used from Mike's(data) above:
In [729]: df = pd.DataFrame(np.array([
.....: [1202, 2007, 99.34,None],
.....: [9321, 2009, 61.21,'12|34'],
.....: [3832, 2012, 12.32,'12|12|34'],
.....: [1723, 2017, 873.74,'28|13|51']]),
.....: columns=['ID', 'YEAR', 'AMT','PARTS'])
# quick glimpse of dataframe
In [730]: df
Out[730]:
ID YEAR AMT PARTS
0 1202 2007 99.34 None
1 9321 2009 61.21 12|34
2 3832 2012 12.32 12|12|34
3 1723 2017 873.74 28|13|51
# expand string based on delimiter ("|")
In [731]: expand_str = df["PARTS"].str.split('|', expand=True)
# generate dummies df:
In [732]: dummies_df = pd.get_dummies(expand_str.stack(dropna=False)).sum(level=0).add_prefix("Part_")
# gives concatenated or combined df(i.e dummies_df + original df):
In [733]: pd.concat([df, dummies_df], axis=1)
Out[733]:
ID YEAR AMT PARTS Part_12 Part_13 Part_28 Part_34 Part_51
0 1202 2007 99.34 None 0 0 0 0 0
1 9321 2009 61.21 12|34 1 0 0 1 0
2 3832 2012 12.32 12|12|34 2 0 0 1 0
3 1723 2017 873.74 28|13|51 0 1 1 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.