简体   繁体   中英

Python: combine boolean columns in Pandas dataframes

I have the following data

attr1_A    attr1_B    attr1_C    attr1_D    attr2_A    attr2_B   attr2_C
      1          0          0          1          1          0         0
      0          1          1          0          0          0         1
      0          0          0          0          0          1         0
      1          1          1          0          1          1         0

I want to retain attr1_A , attr1_B and combine attr1_C and attr1_D into attr1_others . As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others . Like this:

attr1_A    attr1_B    attr1_others    attr2_A    attr2_others
      1          0          1               1               0     
      0          1          1               0               1  
      0          0          0               0               1 
      1          1          1               1               1 

In other words, for any group of attr , I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.

I am thinking of doing each group separately: processing all attr1_* , and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.

What I can think right now is to retrieve the others columns like:

# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]

# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]

And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?

Updated attempt :

I tried this

# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns) 
                            if "attr1_" in x
                            and "A" not in x 
                            and "B" not in x]].any(axis = 'column')]

but got the below error:

ValueError: No axis named column for object type < class 'pandas.core.frame.DataFrame'>

Dataframes have the great ability to manipulate data in place, without having to write complex python logic.

To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:

df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']

If instead, you wanted an and condition, you could use:

df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']

You can then delete the lingering original values using del :

del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']

Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1 . Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2 . Finally, join and sort_index

keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
          .any(1).add_suffix('_others').astype(int))

Out[512]:
   attr1_others  attr2_others
0             1             0
1             1             1
2             0             1
3             1             1

df_final = df[keep_cols].join(df2).sort_index(1)

Out[514]:
   attr1_A  attr1_B  attr1_others  attr2_A  attr2_others
0        1        0             1        1             0
1        0        1             1        0             1
2        0        0             0        0             1
3        1        1             1        1             1

You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int) .

For example:

import pandas as pd

df = pd.DataFrame({
        'attr1_A': [1, 0, 0, 1],
        'attr1_B': [0, 1, 0, 1],
        'attr1_C': [0, 1, 0, 1],
        'attr1_D': [1, 0, 0, 0],
        'attr2_A': [1, 0, 0, 1],
        'attr2_B': [0, 0, 1, 1],
        'attr2_C': [0, 1, 0, 0]})

cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)

cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)

print(df)

Prints:

   attr1_A  attr1_B  attr2_A  attr1_others  attr2_others
0        1        0        1             1             0
1        0        1        0             1             1
2        0        0        0             0             1
3        1        1        1             1             1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM