简体   繁体   中英

How to sum values of multiple columns against multiple columns of the same dataframe?

I have a dataframe of surgeries and their complications with True and False values. I need to know how many times each complication occurs to each surgery. Each row represents a patient. The dataframe looks like this:

surgery_1 surgery_2 Surgery_3 complication_1  complication_2 complication_3
True        False     True       True              True         False
False       False     False      False             False        False
True        False     False      True              False        True

I want to have a dataframe like this:

           complication_1    complication_2     complication_3
surgery_1       1                  1                   0
surgery_2       0                  0                   0
surgery_3       1                  0                   1

I tried df.pivot_table and df.groupby but nothing helps me. Note that I'm not interested in how much the surgeries are. I just need to know how many times each complication occurs to each surgery

If I understand correctly, each row represents an operation performed on a patient. In the operation multiple surgeries might be performed.

Step 1 is to unpivot the DataFrame in order to have each row represent a surgery as this going to be the key of the new DataFrame

In [58]: df2 = pd.wide_to_long(df, 'surgery_', ['complication_1', 'complication_2', 'complication_3'], 'surgery_id').reset_index()
Out[58]: 
   complication_1  complication_2  complication_3  surgery_id  surgery_
0            True            True           False           1      True
1            True            True           False           2     False
2            True            True           False           3      True
3           False           False           False           1     False
4           False           False           False           2     False
5           False           False           False           3     False
6            True           False            True           1      True
7            True           False            True           2     False
8            True           False            True           3     False

Now you have a row for each surgery and each patient. However not all surgeries are performed on all patients. This is given in remaining value column 'surgery_' . Step 2 is to filter so we are only left with the rows where a surgery was actually performed

In [64]: df3 = df2.query('surgery_ == True').drop('surgery_', axis=1)
Out[64]: 
   complication_1  complication_2  complication_3  surgery_id
0            True            True           False           1
2            True            True           False           3
6            True           False            True           1

Step 3 is then straightforward: groupby , sum and reindex because there is no entry for 'surgery_2'

In [67]: df2.groupby('surgery_id').sum().reindex([1,2,3], fill_value=0)
Out[67]: 
            complication_1  complication_2  complication_3
surgery_id                                                
1                        2               1               1
2                        0               0               0
3                        1               1               0

This differs significantly from you desired output, but frankly I have no idea what you could want other than this;)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM