为 dataframe 中的变量列表创建缺失的虚拟指标变量，并在 python 中使用特定值（熊猫）

Question

I have a large dataset in pandas.我在 pandas 中有一个大型数据集。 For brevity, let's say I have the following为简洁起见，假设我有以下内容

df = pd.DataFrame({'col1': [101,101,101,201,201,201,np.nan],
                  'col2':[123,123,124,np.nan,321,321,456],
                 'col3':['a',0.7,0.6,1.01,2,1,2],
                 'col4':['w',0.2,'b',0.7,'z',2,3],
                 'col5':[21,'z',0.3,2.3,0.8,'z',1.001],
                 'col6':[11.3,202.0,0.2,0.3,41.0,47,2],
                 'col7':['A','B','C','D','E','F','G']})

Now I want to create categorical variables with the suffix _missing such that for any column in the dataset that contains missing nan a new column (variable) should be created that has values 1 for 'nan' values and 0 otherwise.现在我想创建带有后缀_missing的分类变量，这样对于数据集中包含缺失nan的任何列，都应该创建一个新列（变量），其值为1表示“nan”值，否则为0 。 For example, for col1 and col2 , their corresponding variables will be col1_missing and col2_missing .例如，对于col1和col2 ，它们对应的变量将是col1_missing和col2_missing 。

Then for columns like col3 that have alphabets in a column that is supposed to be numeric, I will like similar result as described above, but with the levels of categories increasing with the number of different alphabets.然后对于像col3这样在应该是数字的列中有字母的列，我会喜欢与上述类似的结果，但类别的级别会随着不同字母的数量而增加。 For example the new column corresponding to col4 will be col4_missing and will contain 0 for non-alphabets, 1 for b , 2 for w and 3 for z .例如，对应于col4的新列将是col4_missing并且将包含 0 用于非字母， 1用于b ， 2用于w和3用于z 。 So the resulting frame should look as below:因此生成的框架应如下所示：

Is there any python function or package to do this?有没有 python function 或 package 可以做到这一点？ As a newbie, I am honestly overwhelmed with this and I would be grateful for any help on this.作为一个新手，老实说，我对此感到不知所措，我将不胜感激在这方面的任何帮助。

Answer 1

You can map the values from a dictionary:您可以map字典中的值：

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

out = (pd
 .concat([df, df.apply(flag).add_suffix('_missing')], axis=1)
 .sort_index(axis=1)
 )

Output: Output：

    col1  col1_missing   col2  col2_missing  col3  col3_missing col4  col4_missing   col5  col5_missing   col6  col6_missing col7  col7_missing
0  101.0             0  123.0             0     a             0    w             2     21             0   11.3             0    A             0
1  101.0             0  123.0             0   0.7             0  0.2             0      z             3  202.0             0    B             0
2  101.0             0  124.0             0   0.6             0    b             1    0.3             0    0.2             0    C             0
3  201.0             0    NaN             1  1.01             0  0.7             0    2.3             0    0.3             0    D             0
4  201.0             0  321.0             0     2             0    z             3    0.8             0   41.0             0    E             0
5  201.0             0  321.0             0     1             0    2             0      z             3   47.0             0    F             0
6    NaN             1  456.0             0     2             0    3             0  1.001             0    2.0             0    G             0

only columns with at least one non-zero仅具有至少一个非零的列

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

# flag values 
df2 = df.apply(flag).add_suffix('_missing')

# keep only columns with at least one flag
df2 = df2.loc[:, df2.ne(0).any()]

out = (pd
 .concat([df, df2], axis=1)
 .sort_index(axis=1)
 )

为 dataframe 中的变量列表创建缺失的虚拟指标变量，并在 python 中使用特定值（熊猫）

问题描述

1 个解决方案

解决方案1
0 2022-09-24 03:29:27

only columns with at least one non-zero仅具有至少一个非零的列

为 dataframe 中的变量列表创建缺失的虚拟指标变量，并在 python 中使用特定值（熊猫）

问题描述

1 个解决方案

解决方案1 0 2022-09-24 03:29:27

only columns with at least one non-zero仅具有至少一个非零的列

解决方案1
0 2022-09-24 03:29:27