I'm creating a dataframe by pairing down a very large dataframe (approximately 400 columns) based on a choices an enduser makes on a picklist. One of the picklist choices is the type of denominator that the enduser would like. Here is one example table with all the information before the final calculation is made.
county _tcount _tvote _f_npb_18_count _f_npb_18_vote
countycode
35 San Benito 28194 22335 2677 1741
36 San Bernardino 912653 661838 108724 61832
countycode _f_npb_30_count _f_npb_30_vote
35 384 288
36 76749 53013
However, I am trouble creating code that will automatically divide every column starting with the 5th (not including the index) by the column before it (skipping every other column). I've seen examples ( Divide multiple columns by another column in pandas ), but they all use fixed column names which is not achievable for this aspect. I've able to variable columns (based on positions) by fixed columns, but not variable columns by other variable columns based on position. I've tried modifying the code in the above link based on the column positions:
calculated_frame = [county_select_frame[county_select_frame.columns[5: : 2]].div(county_select_frame[4: :2], axis=0)]
output:
[ county _tcount _tvote _f_npb_18_count _f_npb_18_vote \
countycode
35 NaN NaN NaN NaN NaN
36 NaN NaN NaN NaN NaN]
RuntimeWarning: invalid value encountered in greater (abs_vals > 0)).any()
The use of [5: :2]
does work when the dividend is a fixed field.If I can't get this to work, it's not a big deal (But it would be great to have all options I wanted).
My preference would be to organize it by setting the index and using filter
to split out a counts and votes dataframes separately. Then use join
d1 = df.set_index('county', append=True)
counts = d1.filter(regex='.*_\d+_count$').rename(columns=lambda x: x.replace('_count', ''))
votes = d1.filter(regex='.*_\d+_vote$').rename(columns=lambda x: x.replace('_vote', ''))
d1[['_tcount', '_tvote']].join(votes / counts)
_tcount _tvote _f_npb_18 _f_npb_30
countycode county
35 San Benito 28194 22335 0.650355 0.750000
36 San Bernardino 912653 661838 0.568706 0.690732
I think you can divide by numpy array
s created by values
, because then not align columns names. Last create new DataFrame
by constructor:
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
Sample:
np.random.seed(10)
county_select_frame = pd.DataFrame(np.random.randint(10, size=(10,10)),
columns=list('abcdefghij'))
print (county_select_frame)
a b c d e f g h i j
0 9 4 0 1 9 0 1 8 9 0
1 8 6 4 3 0 4 6 8 1 8
2 4 1 3 6 5 3 9 6 9 1
3 9 4 2 6 7 8 8 9 2 0
4 6 7 8 1 7 1 4 0 8 5
5 4 7 8 8 2 6 2 8 8 6
6 6 5 6 0 0 6 9 1 8 9
7 1 2 8 9 9 5 0 2 7 3
8 0 4 2 0 3 3 1 2 5 9
9 0 1 0 1 9 0 9 2 1 1
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
print (df1)
f h j
0 0.000000 8.000000 0.000000
1 inf 1.333333 8.000000
2 0.600000 0.666667 0.111111
3 1.142857 1.125000 0.000000
4 0.142857 0.000000 0.625000
5 3.000000 4.000000 0.750000
6 inf 0.111111 1.125000
7 0.555556 inf 0.428571
8 1.000000 2.000000 1.800000
9 0.000000 0.222222 1.000000
How about something like
cols = my_df.columns
for i in range(2, 6):
print(u'Creating new col %s', cols[i])
my_df['new_{0}'.format(cols[i]) = my_df[cols[i]] / my_df[cols[i-1]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.