简体   繁体   中英

Divide several columns in a python dataframe where the both the numerator and denominator columns will vary based on a picklist

I'm creating a dataframe by pairing down a very large dataframe (approximately 400 columns) based on a choices an enduser makes on a picklist. One of the picklist choices is the type of denominator that the enduser would like. Here is one example table with all the information before the final calculation is made.

                county  _tcount  _tvote  _f_npb_18_count  _f_npb_18_vote  
countycode                                                                     
35              San Benito    28194   22335             2677            1741   
36          San Bernardino   912653  661838           108724           61832



countycode            _f_npb_30_count  _f_npb_30_vote                                  
35                      384             288  
36                    76749           53013

However, I am trouble creating code that will automatically divide every column starting with the 5th (not including the index) by the column before it (skipping every other column). I've seen examples ( Divide multiple columns by another column in pandas ), but they all use fixed column names which is not achievable for this aspect. I've able to variable columns (based on positions) by fixed columns, but not variable columns by other variable columns based on position. I've tried modifying the code in the above link based on the column positions:

calculated_frame = [county_select_frame[county_select_frame.columns[5: : 2]].div(county_select_frame[4: :2], axis=0)]

output:

[           county  _tcount  _tvote  _f_npb_18_count  _f_npb_18_vote  \
countycode                                                         
35            NaN      NaN     NaN              NaN             NaN
36            NaN      NaN     NaN              NaN             NaN]

RuntimeWarning: invalid value encountered in greater (abs_vals > 0)).any()

The use of [5: :2] does work when the dividend is a fixed field.If I can't get this to work, it's not a big deal (But it would be great to have all options I wanted).

My preference would be to organize it by setting the index and using filter to split out a counts and votes dataframes separately. Then use join

d1 = df.set_index('county', append=True)
counts = d1.filter(regex='.*_\d+_count$').rename(columns=lambda x: x.replace('_count', ''))
votes = d1.filter(regex='.*_\d+_vote$').rename(columns=lambda x: x.replace('_vote', ''))

d1[['_tcount', '_tvote']].join(votes / counts)

                           _tcount  _tvote  _f_npb_18  _f_npb_30
countycode county                                               
35         San Benito        28194   22335   0.650355   0.750000
36         San Bernardino   912653  661838   0.568706   0.690732

I think you can divide by numpy array s created by values , because then not align columns names. Last create new DataFrame by constructor:

arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])

Sample:

np.random.seed(10)
county_select_frame = pd.DataFrame(np.random.randint(10, size=(10,10)),
                                   columns=list('abcdefghij'))
print (county_select_frame)
   a  b  c  d  e  f  g  h  i  j
0  9  4  0  1  9  0  1  8  9  0
1  8  6  4  3  0  4  6  8  1  8
2  4  1  3  6  5  3  9  6  9  1
3  9  4  2  6  7  8  8  9  2  0
4  6  7  8  1  7  1  4  0  8  5
5  4  7  8  8  2  6  2  8  8  6
6  6  5  6  0  0  6  9  1  8  9
7  1  2  8  9  9  5  0  2  7  3
8  0  4  2  0  3  3  1  2  5  9
9  0  1  0  1  9  0  9  2  1  1

arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
print (df1)
          f         h         j
0  0.000000  8.000000  0.000000
1       inf  1.333333  8.000000
2  0.600000  0.666667  0.111111
3  1.142857  1.125000  0.000000
4  0.142857  0.000000  0.625000
5  3.000000  4.000000  0.750000
6       inf  0.111111  1.125000
7  0.555556       inf  0.428571
8  1.000000  2.000000  1.800000
9  0.000000  0.222222  1.000000

How about something like

cols = my_df.columns
for i in range(2, 6):
    print(u'Creating new col %s', cols[i])
    my_df['new_{0}'.format(cols[i]) = my_df[cols[i]] / my_df[cols[i-1] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM