[英]Creating missing dummy indicator variable for a list of variable in a dataframe with specific values in python (pandas)
I have a large dataset in pandas.我在 pandas 中有一个大型数据集。 For brevity, let's say I have the following
为简洁起见,假设我有以下内容
df = pd.DataFrame({'col1': [101,101,101,201,201,201,np.nan],
'col2':[123,123,124,np.nan,321,321,456],
'col3':['a',0.7,0.6,1.01,2,1,2],
'col4':['w',0.2,'b',0.7,'z',2,3],
'col5':[21,'z',0.3,2.3,0.8,'z',1.001],
'col6':[11.3,202.0,0.2,0.3,41.0,47,2],
'col7':['A','B','C','D','E','F','G']})
Now I want to create categorical variables with the suffix _missing
such that for any column in the dataset that contains missing nan
a new column (variable) should be created that has values 1
for 'nan' values and 0
otherwise.现在我想创建带有后缀
_missing
的分类变量,这样对于数据集中包含缺失nan
的任何列,都应该创建一个新列(变量),其值为1
表示“nan”值,否则为0
。 For example, for col1
and col2
, their corresponding variables will be col1_missing
and col2_missing
.例如,对于
col1
和col2
,它们对应的变量将是col1_missing
和col2_missing
。
Then for columns like col3
that have alphabets in a column that is supposed to be numeric, I will like similar result as described above, but with the levels of categories increasing with the number of different alphabets.然后对于像
col3
这样在应该是数字的列中有字母的列,我会喜欢与上述类似的结果,但类别的级别会随着不同字母的数量而增加。 For example the new column corresponding to col4
will be col4_missing
and will contain 0 for non-alphabets, 1
for b
, 2
for w
and 3
for z
.例如,对应于
col4
的新列将是col4_missing
并且将包含 0 用于非字母, 1
用于b
, 2
用于w
和3
用于z
。 So the resulting frame should look as below:因此生成的框架应如下所示:
Is there any python function or package to do this?有没有 python function 或 package 可以做到这一点? As a newbie, I am honestly overwhelmed with this and I would be grateful for any help on this.
作为一个新手,老实说,我对此感到不知所措,我将不胜感激在这方面的任何帮助。
You can map
the values from a dictionary:您可以
map
字典中的值:
def flag(s):
flags = {'b': 1, 'w': 2, 'z': 3}
return s.fillna('b').map(lambda x: flags.get(x, 0))
out = (pd
.concat([df, df.apply(flag).add_suffix('_missing')], axis=1)
.sort_index(axis=1)
)
Output: Output:
col1 col1_missing col2 col2_missing col3 col3_missing col4 col4_missing col5 col5_missing col6 col6_missing col7 col7_missing
0 101.0 0 123.0 0 a 0 w 2 21 0 11.3 0 A 0
1 101.0 0 123.0 0 0.7 0 0.2 0 z 3 202.0 0 B 0
2 101.0 0 124.0 0 0.6 0 b 1 0.3 0 0.2 0 C 0
3 201.0 0 NaN 1 1.01 0 0.7 0 2.3 0 0.3 0 D 0
4 201.0 0 321.0 0 2 0 z 3 0.8 0 41.0 0 E 0
5 201.0 0 321.0 0 1 0 2 0 z 3 47.0 0 F 0
6 NaN 1 456.0 0 2 0 3 0 1.001 0 2.0 0 G 0
def flag(s):
flags = {'b': 1, 'w': 2, 'z': 3}
return s.fillna('b').map(lambda x: flags.get(x, 0))
# flag values
df2 = df.apply(flag).add_suffix('_missing')
# keep only columns with at least one flag
df2 = df2.loc[:, df2.ne(0).any()]
out = (pd
.concat([df, df2], axis=1)
.sort_index(axis=1)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.