I need to calculate a custom sum of a column df['P']
for each row of a pandas dataframe. I am currently doing it as a for loop, which I realize is very inefficient but let me lay out the structure for the calculation. I am trying to come up with a more pythonic/pandas-consistent implementation to decrease runtime. I use the solution from this post: pandas: rapidly calculating sum of column with certain values to increase speed, but it still runs very slowly.
def weight_sum(inc_grp, taz, chosen, probs, hh_id, row_inc_grp, row_taz, row_hh_id):
return beta_dict['RHO'] * (sum(p for i,j,k,p in zip(inc_grp, taz, chosen, probs) \
if i==row_inc_grp and j in w[row_taz] and k==1)
+ sum(p for i,j,k,p in zip(inc_grp, hh_id, chosen, probs) \
if i==row_inc_grp and j!=row_hh_id and k==1))
inc_grp = df['income_grp'].values
taz = df['taz'].values
chosen = df['chosen'].values
hh_id = df['hh_id'].values
probs = df['P'].values
for row in df.itertuples():
df.loc[row[0], 'V_comb'] = row.V_comb + weight_sum(inc_grp, taz, chosen, probs,
hh_id, row.income_grp, row.taz, row.hh_id)
Basically, the code does the following:
df['income_grp']
is equal to the target row and df['chosen']
column equals 1 df['taz']
value and the item being a list of associated with df['taz']
values I want to sum over. df['hh_id']
) I'm sure there is a way to do this, but it has been eluding me. There are about 28,000 rows in the dataframe and this section of code is a major runtime drain. Is there a way to apply this operation on the entire dataframe column at once? I think a groupby().sum() might work.
This is a subset of the dataframe:
hh_mem_id hh_id memb_id taz_struc taz income_grp chosen V_comb P
0 11 11 0 4028.2 4028 2 1 2.0289830623 0.1420552675
1 2002 2002 0 4028.2 4028 3 0 0.1571991902 0.0109275283
2 3775.1 3775 1 4028.2 4028 3 0 1.5821643888 0.045433528
3 1099.2 1099 2 4028.2 4028 3 0 0.3537670241 0.0133011829
4 3249.1 3249 1 4028.2 4028 3 0 0.6103028388 0.017191048
5 2903 2903 0 4028.2 4028 3 0 0.3912196062 0.0276175857
6 3671 3671 0 4028.2 4028 4 0 1.1843450617 0.0203476596
7 133 133 0 4028.2 4028 3 0 0.4345199881 0.014419853
8 1563.2 1563 2 4028.2 4028 5 0 0.0036775258 0.0062482309
9 142 142 0 4028.2 4028 4 0 0.7255248979 0.0192904633
10 5097 5097 0 4028.2 4028 3 0 0.0811923744 0.0202554826
11 3489.2 3489 2 4028.2 4028 4 0 -0.2867591139 0.0046732825
12 2432.1 2432 1 4028.2 4028 2 0 0.0827980747 0.0101440165
13 4296 4296 0 4028.2 4028 3 0 0.5167749373 0.0156561042
14 5377 5377 0 4028.2 4028 2 0 -1.0837694081 0.0063183855
15 3546 3546 0 4028.2 4028 1 0 -1.1511959076 0.0059064042
16 3084 3084 0 4028.2 4028 2 0 -0.6162896774 0.0100839339
17 3506.1 3506 1 4028.2 4028 5 0 0.8353570673 0.0143532716
18 798.1 798 0 4028.2 4028 3 0 1.1557859384 0.0593243037
19 4067 4067 0 4028.2 4028 5 0 0.7786698771 0.013562257
20 786.2 786 2 4028.2 4028 5 0 0.1487080264 0.0054175668
21 4155 4155 0 4028.2 4028 5 0 0.2379145637 0.0118461215
22 3036.1 3036 1 4028.2 4028 5 0 0.9867959382 0.0125251009
23 4223.1 4223 1 4028.2 4028 5 0 0.7162872899 0.0127420574
24 3510 3510 0 4028.2 4028 2 0 -0.4016915094 0.0124976624
25 1736.1 1736 0 4028.2 4028 3 0 1.3770839318 0.0370093239
26 2336.1 2336 1 4028.2 4028 3 0 0.626406915 0.0174701352
27 2367.1 2367 1 4028.2 4028 5 0 0.2879033723 0.0124533457
28 4150.2 4150 2 4028.2 4028 5 0 -0.2505594914 0.0048455529
29 4270 4270 0 4028.2 4028 5 0 0.5620574806 0.0109208993
30 2002.1 2002 1 4028.2 4028 3 0 -0.694312505 0.0046635336
31 3775 3775 0 4028.2 4028 3 0 -0.251272972 0.0072631453
32 1099.1 1099 0 4028.2 4028 3 0 0.7689167591 0.0201459385
33 3249 3249 0 4028.2 4028 3 0 0.0015696848 0.0093526117
34 3671.2 3671 2 4028.2 4028 4 0 -0.0300530998 0.006040989
35 3671.1 3671 1 4028.2 4028 4 0 0.7186898628 0.0127727079
36 133.1 133 1 4028.2 4028 3 0 0.1183203344 0.0105108313
37 1563 1563 0 4028.2 4028 5 0 0.7554359922 0.0132507855
38 1563.3 1563 3 4028.2 4028 5 0 0.856618101 0.0146617042
39 142.1 142 1 4028.2 4028 4 0 -0.5234586083 0.0055324311
40 3489.1 3489 1 4028.2 4028 4 0 0.5136023055 0.0104043412
41 3489 3489 0 4028.2 4028 4 0 1.0174426754 0.0172198625
42 2432 2432 0 4028.2 4028 2 0 0.2873825304 0.0124468612
43 4296.1 4296 1 4028.2 4028 3 0 0.0794730632 0.0101103435
44 3506.2 3506 2 4028.2 4028 5 0 0.0184839582 0.0063414332
45 3506 3506 0 4028.2 4028 5 0 0.2625970387 0.0080947676
46 4067.2 4067 2 4028.2 4028 5 0 0.6172063558 0.0115400915
47 4067.1 4067 1 4028.2 4028 5 0 0.6173185103 0.0115413859
48 786.3 786 3 4028.2 4028 5 0 0.1487080264 0.0054175668
49 786.1 786 1 4028.2 4028 5 0 0.6050092935 0.0085501434
50 786 786 0 4028.2 4028 5 0 0.7613981637 0.0099975187
51 4155.1 4155 1 4028.2 4028 5 0 0.6072911746 0.0171393523
52 3036.2 3036 2 4028.2 4028 5 0 0.7048105533 0.0094474921
53 3036 3036 0 4028.2 4028 5 0 0.627374922 0.0087435273
54 3036.5 3036 5 4028.2 4028 5 0 0.5908809189 0.0084301932
55 4223 4223 0 4028.2 4028 5 0 0.9146967449 0.0155384498
56 4223.3 4223 3 4028.2 4028 5 0 0.9352868379 0.0158617044
57 1736.3 1736 3 4028.2 4028 3 0 0.4855928507 0.0151754471
58 2336 2336 0 4028.2 4028 3 0 0.5800003478 0.0166779301
59 2367 2367 0 4028.2 4028 5 0 0.5503894858 0.0161913222
60 4150 4150 0 4028.2 4028 5 0 0.2127295435 0.0077010015
61 4150.1 4150 1 4028.2 4028 5 0 0.4936026393 0.0101983249
62 4270.2 4270 2 4028.2 4028 5 0 0.9579755018 0.0162256989
63 4270.1 4270 1 4028.2 4028 5 0 0.6540339302 0.0119730078
64 12 12 0 3649.1 3649 5 1 0.7922317695 0.0119365752
65 1922 1922 0 3649.1 3649 2 0 -0.4376740892 0.0069786016
66 5434 5434 0 3649.1 3649 2 0 1.5455019765 0.0507050046
67 3427 3427 0 3649.1 3649 3 0 1.0252726867 0.030138256
68 1710 1710 0 3649.1 3649 3 0 1.4636873348 0.0467217584
69 215 215 0 3649.1 3649 4 0 0.8383515125 0.0083333194
70 3872.1 3872 1 3649.1 3649 5 0 0.5878580212 0.0097301906
71 4184 4184 0 3649.1 3649 3 0 1.6013392113 0.0536167678
72 2305 2305 0 3649.1 3649 2 0 0.914665738 0.0134912482
73 3928 3928 0 3649.1 3649 3 0 1.6743119993 0.0576756249
74 3653 3653 0 3649.1 3649 3 0 1.1358984857 0.0336637343
75 138 138 0 3649.1 3649 3 0 1.7493749526 0.0310857779
76 458 458 0 3649.1 3649 3 0 1.4085683914 0.0442161909
77 1469 1469 0 3649.1 3649 3 0 1.2873661026 0.0391691224
78 5625.2 5625 2 3649.1 3649 5 0 0.2433721144 0.0045964417
79 2606.1 2606 1 3649.1 3649 5 0 0.5828831254 0.0096819041
80 3931.1 3931 1 3649.1 3649 4 0 0.9396346763 0.0069161756
81 4131.2 4131 2 3649.1 3649 5 0 0.5232201888 0.0045605739
82 4302.1 4302 1 3649.1 3649 3 0 0.893931835 0.013214402
83 1754 1754 0 3649.1 3649 2 0 -0.3000669052 0.0080081177
84 2936.1 2936 0 3649.1 3649 3 0 0.6754471945 0.0212417765
85 2737.2 2737 2 3649.1 3649 3 0 -0.5740444845 0.0030444826
86 4040 4040 0 3649.1 3649 3 0 1.0270476272 0.0150958985
87 3007 3007 0 3649.1 3649 5 0 0.8287041974 0.0082533118
88 4198 4198 0 3649.1 3649 2 0 1.7898540629 0.0647398352
89 4886 4886 0 3649.1 3649 5 0 1.0735474149 0.010542954
90 2898 2898 0 3649.1 3649 2 0 1.4747234015 0.0472402386
91 507 507 0 3649.1 3649 3 0 1.0621690726 0.0312710176
92 3320 3320 0 3649.1 3649 2 0 1.8349981668 0.0677294306
93 1725.2 1725 2 3649.1 3649 3 0 0.7758190633 0.0117422626
94 215.2 215 2 3649.1 3649 4 0 0.2386153377 0.0045746294
95 215.1 215 1 3649.1 3649 4 0 1.499844627 0.0161473343
96 3872 3872 0 3649.1 3649 5 0 0.9871911231 0.0145060613
97 2305.2 2305 2 3649.1 3649 2 0 0.7395638436 0.0113241691
98 138.1 138 1 3649.1 3649 3 0 0.9743617728 0.0143211467
99 5625 5625 0 3649.1 3649 5 0 0.5903762734 0.0065031497
100 5625.1 5625 1 3649.1 3649 5 0 0.9824527912 0.0096249929
101 2606 2606 0 3649.1 3649 5 0 1.2693837925 0.0192355331
102 3931.2 3931 2 3649.1 3649 4 0 0.928477973 0.0068394427
103 3931 3931 0 3649.1 3649 4 0 0.855892031 0.0063605847
104 3931.3 3931 3 3649.1 3649 4 0 0.8567504113 0.0063660469
105 4131.3 4131 3 3649.1 3649 5 0 0.7858987531 0.0059306097
106 4131 4131 0 3649.1 3649 5 0 0.4918550313 0.0044197508
107 4131.1 4131 1 3649.1 3649 5 0 1.3324098035 0.010243446
108 4302 4302 0 3649.1 3649 3 0 1.0205806143 0.0149985882
109 2737.1 2737 0 3649.1 3649 3 0 0.7340224027 0.0112615905
110 4040.1 4040 1 3649.1 3649 3 0 0.6811995799 0.0106821598
111 3007.1 3007 1 3649.1 3649 5 0 0.825227624 0.0082246684
112 3007.2 3007 2 3649.1 3649 5 0 0.7815236308 0.007872959
113 4886.1 4886 1 3649.1 3649 5 0 0.7827331819 0.0078824876
114 4886.2 4886 2 3649.1 3649 5 0 0.7767939208 0.0078358102
115 1725.1 1725 0 3649.1 3649 3 0 0.9985947281 0.0146724295
116 12.1 12 1 3649.1 3649 5 1 1.0093720796 0.0148314146
117 40 40 0 3602.2 3602 3 1 1.4149337468 0.0496880853
118 2728 2728 0 3602.2 3602 3 0 0.2540527003 0.0155628105
119 4786.1 4786 0 3602.2 3602 3 0 1.8863507604 0.0796133813
This is an sample entry for 'w' for df['taz'] == 4028
:
{3602: 1.0, 4027: 1.0, 4029: 1.0}
For row 1, I need to calculate df['P'].sum()
where df['taz'] == 4028
, df['inc_grp] == 2
, and df['chosen'] == 1
. I also need the summation where df['hh_id'] != 11
, df['inc_grp] == 2
, and df['chosen'] == 1
. This should be added to the column df['V_comb']
. I need to do this for each row of the dataframe and the code is run multiple times because it is part of an optimization algorithm.
根据您编辑的帖子,这应该可以完成您想要的操作:
df['V_comb'] = df[(df['income_grp']==2) & (df['taz']==4028) & (df['chosen']==1)][['P','V_comb']].sum(axis=1)
I was able to drastically improve the runtime through a combination of changes. First, there was no reason to perform the filter on the dataframe everytime the optimization ran. I did this once in a for loop at the beginning of the program, which was optimized by placing it in a function and using cython. The result is a numpy array containing 0/1 for whether each condition is true between each pair of rows. I can then obtain the sum of the probabilities taking the dot product of this matrix with the vectorized form of the dataframe column. Most time is now spent in the optimization according to my profiling (easily improved by updating initial parameter values to the output of last run). A snippet of the code:
import numpy as np
cimport numpy as np
def get_filt_mat(long[:, :] X, double[:, :] Y, M):
cdef int N = X.shape[0]
cdef int[:] indices, indptr
cdef int i, j
indices = M.indices.astype(np.int32)
indptr = M.indptr.astype(np.int32)
cdef int I = indptr.shape[0]
for i in range(N):
for j in range(N):
if X[i,0] == X[j,0] and X[j,3] == 1:
if N<=I:
if indptr[i]==X[i,2] and indices[j]==X[j,2]:
Y[i,j] = 1
if X[i,1] == X[j,1] and X[j,2] != X[i,2]:
Y[i,j] = 1
return Y
Function call:
N = df.shape[0]
filtArray = np.zeros((N,N))
inArray = df[['income_grp', 'taz', 'hh_id', 'chosen']].values
outArray = get_filt_mat(inArray, filtArray, ws)
outArray = outArray.base
Application to dataframe column:
vectProb = df['P'].values
df['P_w'] = outArray.dot(vectProb) * beta_dict['RHO']
This is my first time using cython, and this is probably less than perfect code, but it now runs in about 10 minutes vs. 14 hours without completion using my original algorithms in pure python and pandas. I found these resources helpful (especially with getting cython to work with sparse matrices):
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.