简体   繁体   English

为 dataframe 中的变量列表创建缺失的虚拟指标变量,并在 python 中使用特定值(熊猫)

[英]Creating missing dummy indicator variable for a list of variable in a dataframe with specific values in python (pandas)

I have a large dataset in pandas.我在 pandas 中有一个大型数据集。 For brevity, let's say I have the following为简洁起见,假设我有以下内容

df = pd.DataFrame({'col1': [101,101,101,201,201,201,np.nan],
                  'col2':[123,123,124,np.nan,321,321,456],
                 'col3':['a',0.7,0.6,1.01,2,1,2],
                 'col4':['w',0.2,'b',0.7,'z',2,3],
                 'col5':[21,'z',0.3,2.3,0.8,'z',1.001],
                 'col6':[11.3,202.0,0.2,0.3,41.0,47,2],
                 'col7':['A','B','C','D','E','F','G']})

初始数据

Now I want to create categorical variables with the suffix _missing such that for any column in the dataset that contains missing nan a new column (variable) should be created that has values 1 for 'nan' values and 0 otherwise.现在我想创建带有后缀_missing的分类变量,这样对于数据集中包含缺失nan的任何列,都应该创建一个新列(变量),其值为1表示“nan”值,否则为0 For example, for col1 and col2 , their corresponding variables will be col1_missing and col2_missing .例如,对于col1col2 ,它们对应的变量将是col1_missingcol2_missing

Then for columns like col3 that have alphabets in a column that is supposed to be numeric, I will like similar result as described above, but with the levels of categories increasing with the number of different alphabets.然后对于像col3这样在应该是数字的列中有字母的列,我会喜欢与上述类似的结果,但类别的级别会随着不同字母的数量而增加。 For example the new column corresponding to col4 will be col4_missing and will contain 0 for non-alphabets, 1 for b , 2 for w and 3 for z .例如,对应于col4的新列将是col4_missing并且将包含 0 用于非字母, 1用于b2用于w3用于z So the resulting frame should look as below:因此生成的框架应如下所示:

结果数据框

Is there any python function or package to do this?有没有 python function 或 package 可以做到这一点? As a newbie, I am honestly overwhelmed with this and I would be grateful for any help on this.作为一个新手,老实说,我对此感到不知所措,我将不胜感激在这方面的任何帮助。

You can map the values from a dictionary:您可以map字典中的值:

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

out = (pd
 .concat([df, df.apply(flag).add_suffix('_missing')], axis=1)
 .sort_index(axis=1)
 )

Output: Output:

    col1  col1_missing   col2  col2_missing  col3  col3_missing col4  col4_missing   col5  col5_missing   col6  col6_missing col7  col7_missing
0  101.0             0  123.0             0     a             0    w             2     21             0   11.3             0    A             0
1  101.0             0  123.0             0   0.7             0  0.2             0      z             3  202.0             0    B             0
2  101.0             0  124.0             0   0.6             0    b             1    0.3             0    0.2             0    C             0
3  201.0             0    NaN             1  1.01             0  0.7             0    2.3             0    0.3             0    D             0
4  201.0             0  321.0             0     2             0    z             3    0.8             0   41.0             0    E             0
5  201.0             0  321.0             0     1             0    2             0      z             3   47.0             0    F             0
6    NaN             1  456.0             0     2             0    3             0  1.001             0    2.0             0    G             0

only columns with at least one non-zero仅具有至少一个非零的列

def flag(s):
    flags = {'b': 1, 'w': 2, 'z': 3}
    return s.fillna('b').map(lambda x: flags.get(x, 0))

# flag values 
df2 = df.apply(flag).add_suffix('_missing')

# keep only columns with at least one flag
df2 = df2.loc[:, df2.ne(0).any()]

out = (pd
 .concat([df, df2], axis=1)
 .sort_index(axis=1)
 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM