简体   繁体   中英

Pythonic/fast method to create pandas column: subset sum of column values

I need to calculate a custom sum of a column df['P'] for each row of a pandas dataframe. I am currently doing it as a for loop, which I realize is very inefficient but let me lay out the structure for the calculation. I am trying to come up with a more pythonic/pandas-consistent implementation to decrease runtime. I use the solution from this post: pandas: rapidly calculating sum of column with certain values to increase speed, but it still runs very slowly.

def weight_sum(inc_grp, taz, chosen, probs, hh_id, row_inc_grp, row_taz, row_hh_id):
    return beta_dict['RHO'] * (sum(p for i,j,k,p in zip(inc_grp, taz, chosen, probs) \
                                   if i==row_inc_grp and j in w[row_taz] and k==1)
                               + sum(p for i,j,k,p in zip(inc_grp, hh_id, chosen, probs) \
                                     if i==row_inc_grp and j!=row_hh_id and k==1))

inc_grp = df['income_grp'].values
taz = df['taz'].values
chosen = df['chosen'].values
hh_id = df['hh_id'].values
probs = df['P'].values
for row in df.itertuples():
    df.loc[row[0], 'V_comb'] = row.V_comb + weight_sum(inc_grp, taz, chosen, probs,
                                    hh_id, row.income_grp, row.taz, row.hh_id)

Basically, the code does the following:

  1. Get rows where df['income_grp'] is equal to the target row and df['chosen'] column equals 1
  2. Also, filter the returned rows to match a dictionary item with the key corresponding to the target row's df['taz'] value and the item being a list of associated with df['taz'] values I want to sum over.
  3. Do a similar subset extraction for rows that match the target row column values, but are not the target row (defined by df['hh_id'] )
  4. Update an existing column with the sum of all these values for each row.

I'm sure there is a way to do this, but it has been eluding me. There are about 28,000 rows in the dataframe and this section of code is a major runtime drain. Is there a way to apply this operation on the entire dataframe column at once? I think a groupby().sum() might work.

This is a subset of the dataframe:

    hh_mem_id   hh_id   memb_id taz_struc   taz income_grp  chosen  V_comb  P
0   11  11  0   4028.2  4028    2   1   2.0289830623    0.1420552675
1   2002    2002    0   4028.2  4028    3   0   0.1571991902    0.0109275283
2   3775.1  3775    1   4028.2  4028    3   0   1.5821643888    0.045433528
3   1099.2  1099    2   4028.2  4028    3   0   0.3537670241    0.0133011829
4   3249.1  3249    1   4028.2  4028    3   0   0.6103028388    0.017191048
5   2903    2903    0   4028.2  4028    3   0   0.3912196062    0.0276175857
6   3671    3671    0   4028.2  4028    4   0   1.1843450617    0.0203476596
7   133 133 0   4028.2  4028    3   0   0.4345199881    0.014419853
8   1563.2  1563    2   4028.2  4028    5   0   0.0036775258    0.0062482309
9   142 142 0   4028.2  4028    4   0   0.7255248979    0.0192904633
10  5097    5097    0   4028.2  4028    3   0   0.0811923744    0.0202554826
11  3489.2  3489    2   4028.2  4028    4   0   -0.2867591139   0.0046732825
12  2432.1  2432    1   4028.2  4028    2   0   0.0827980747    0.0101440165
13  4296    4296    0   4028.2  4028    3   0   0.5167749373    0.0156561042
14  5377    5377    0   4028.2  4028    2   0   -1.0837694081   0.0063183855
15  3546    3546    0   4028.2  4028    1   0   -1.1511959076   0.0059064042
16  3084    3084    0   4028.2  4028    2   0   -0.6162896774   0.0100839339
17  3506.1  3506    1   4028.2  4028    5   0   0.8353570673    0.0143532716
18  798.1   798 0   4028.2  4028    3   0   1.1557859384    0.0593243037
19  4067    4067    0   4028.2  4028    5   0   0.7786698771    0.013562257
20  786.2   786 2   4028.2  4028    5   0   0.1487080264    0.0054175668
21  4155    4155    0   4028.2  4028    5   0   0.2379145637    0.0118461215
22  3036.1  3036    1   4028.2  4028    5   0   0.9867959382    0.0125251009
23  4223.1  4223    1   4028.2  4028    5   0   0.7162872899    0.0127420574
24  3510    3510    0   4028.2  4028    2   0   -0.4016915094   0.0124976624
25  1736.1  1736    0   4028.2  4028    3   0   1.3770839318    0.0370093239
26  2336.1  2336    1   4028.2  4028    3   0   0.626406915 0.0174701352
27  2367.1  2367    1   4028.2  4028    5   0   0.2879033723    0.0124533457
28  4150.2  4150    2   4028.2  4028    5   0   -0.2505594914   0.0048455529
29  4270    4270    0   4028.2  4028    5   0   0.5620574806    0.0109208993
30  2002.1  2002    1   4028.2  4028    3   0   -0.694312505    0.0046635336
31  3775    3775    0   4028.2  4028    3   0   -0.251272972    0.0072631453
32  1099.1  1099    0   4028.2  4028    3   0   0.7689167591    0.0201459385
33  3249    3249    0   4028.2  4028    3   0   0.0015696848    0.0093526117
34  3671.2  3671    2   4028.2  4028    4   0   -0.0300530998   0.006040989
35  3671.1  3671    1   4028.2  4028    4   0   0.7186898628    0.0127727079
36  133.1   133 1   4028.2  4028    3   0   0.1183203344    0.0105108313
37  1563    1563    0   4028.2  4028    5   0   0.7554359922    0.0132507855
38  1563.3  1563    3   4028.2  4028    5   0   0.856618101 0.0146617042
39  142.1   142 1   4028.2  4028    4   0   -0.5234586083   0.0055324311
40  3489.1  3489    1   4028.2  4028    4   0   0.5136023055    0.0104043412
41  3489    3489    0   4028.2  4028    4   0   1.0174426754    0.0172198625
42  2432    2432    0   4028.2  4028    2   0   0.2873825304    0.0124468612
43  4296.1  4296    1   4028.2  4028    3   0   0.0794730632    0.0101103435
44  3506.2  3506    2   4028.2  4028    5   0   0.0184839582    0.0063414332
45  3506    3506    0   4028.2  4028    5   0   0.2625970387    0.0080947676
46  4067.2  4067    2   4028.2  4028    5   0   0.6172063558    0.0115400915
47  4067.1  4067    1   4028.2  4028    5   0   0.6173185103    0.0115413859
48  786.3   786 3   4028.2  4028    5   0   0.1487080264    0.0054175668
49  786.1   786 1   4028.2  4028    5   0   0.6050092935    0.0085501434
50  786 786 0   4028.2  4028    5   0   0.7613981637    0.0099975187
51  4155.1  4155    1   4028.2  4028    5   0   0.6072911746    0.0171393523
52  3036.2  3036    2   4028.2  4028    5   0   0.7048105533    0.0094474921
53  3036    3036    0   4028.2  4028    5   0   0.627374922 0.0087435273
54  3036.5  3036    5   4028.2  4028    5   0   0.5908809189    0.0084301932
55  4223    4223    0   4028.2  4028    5   0   0.9146967449    0.0155384498
56  4223.3  4223    3   4028.2  4028    5   0   0.9352868379    0.0158617044
57  1736.3  1736    3   4028.2  4028    3   0   0.4855928507    0.0151754471
58  2336    2336    0   4028.2  4028    3   0   0.5800003478    0.0166779301
59  2367    2367    0   4028.2  4028    5   0   0.5503894858    0.0161913222
60  4150    4150    0   4028.2  4028    5   0   0.2127295435    0.0077010015
61  4150.1  4150    1   4028.2  4028    5   0   0.4936026393    0.0101983249
62  4270.2  4270    2   4028.2  4028    5   0   0.9579755018    0.0162256989
63  4270.1  4270    1   4028.2  4028    5   0   0.6540339302    0.0119730078
64  12  12  0   3649.1  3649    5   1   0.7922317695    0.0119365752
65  1922    1922    0   3649.1  3649    2   0   -0.4376740892   0.0069786016
66  5434    5434    0   3649.1  3649    2   0   1.5455019765    0.0507050046
67  3427    3427    0   3649.1  3649    3   0   1.0252726867    0.030138256
68  1710    1710    0   3649.1  3649    3   0   1.4636873348    0.0467217584
69  215 215 0   3649.1  3649    4   0   0.8383515125    0.0083333194
70  3872.1  3872    1   3649.1  3649    5   0   0.5878580212    0.0097301906
71  4184    4184    0   3649.1  3649    3   0   1.6013392113    0.0536167678
72  2305    2305    0   3649.1  3649    2   0   0.914665738 0.0134912482
73  3928    3928    0   3649.1  3649    3   0   1.6743119993    0.0576756249
74  3653    3653    0   3649.1  3649    3   0   1.1358984857    0.0336637343
75  138 138 0   3649.1  3649    3   0   1.7493749526    0.0310857779
76  458 458 0   3649.1  3649    3   0   1.4085683914    0.0442161909
77  1469    1469    0   3649.1  3649    3   0   1.2873661026    0.0391691224
78  5625.2  5625    2   3649.1  3649    5   0   0.2433721144    0.0045964417
79  2606.1  2606    1   3649.1  3649    5   0   0.5828831254    0.0096819041
80  3931.1  3931    1   3649.1  3649    4   0   0.9396346763    0.0069161756
81  4131.2  4131    2   3649.1  3649    5   0   0.5232201888    0.0045605739
82  4302.1  4302    1   3649.1  3649    3   0   0.893931835 0.013214402
83  1754    1754    0   3649.1  3649    2   0   -0.3000669052   0.0080081177
84  2936.1  2936    0   3649.1  3649    3   0   0.6754471945    0.0212417765
85  2737.2  2737    2   3649.1  3649    3   0   -0.5740444845   0.0030444826
86  4040    4040    0   3649.1  3649    3   0   1.0270476272    0.0150958985
87  3007    3007    0   3649.1  3649    5   0   0.8287041974    0.0082533118
88  4198    4198    0   3649.1  3649    2   0   1.7898540629    0.0647398352
89  4886    4886    0   3649.1  3649    5   0   1.0735474149    0.010542954
90  2898    2898    0   3649.1  3649    2   0   1.4747234015    0.0472402386
91  507 507 0   3649.1  3649    3   0   1.0621690726    0.0312710176
92  3320    3320    0   3649.1  3649    2   0   1.8349981668    0.0677294306
93  1725.2  1725    2   3649.1  3649    3   0   0.7758190633    0.0117422626
94  215.2   215 2   3649.1  3649    4   0   0.2386153377    0.0045746294
95  215.1   215 1   3649.1  3649    4   0   1.499844627 0.0161473343
96  3872    3872    0   3649.1  3649    5   0   0.9871911231    0.0145060613
97  2305.2  2305    2   3649.1  3649    2   0   0.7395638436    0.0113241691
98  138.1   138 1   3649.1  3649    3   0   0.9743617728    0.0143211467
99  5625    5625    0   3649.1  3649    5   0   0.5903762734    0.0065031497
100 5625.1  5625    1   3649.1  3649    5   0   0.9824527912    0.0096249929
101 2606    2606    0   3649.1  3649    5   0   1.2693837925    0.0192355331
102 3931.2  3931    2   3649.1  3649    4   0   0.928477973 0.0068394427
103 3931    3931    0   3649.1  3649    4   0   0.855892031 0.0063605847
104 3931.3  3931    3   3649.1  3649    4   0   0.8567504113    0.0063660469
105 4131.3  4131    3   3649.1  3649    5   0   0.7858987531    0.0059306097
106 4131    4131    0   3649.1  3649    5   0   0.4918550313    0.0044197508
107 4131.1  4131    1   3649.1  3649    5   0   1.3324098035    0.010243446
108 4302    4302    0   3649.1  3649    3   0   1.0205806143    0.0149985882
109 2737.1  2737    0   3649.1  3649    3   0   0.7340224027    0.0112615905
110 4040.1  4040    1   3649.1  3649    3   0   0.6811995799    0.0106821598
111 3007.1  3007    1   3649.1  3649    5   0   0.825227624 0.0082246684
112 3007.2  3007    2   3649.1  3649    5   0   0.7815236308    0.007872959
113 4886.1  4886    1   3649.1  3649    5   0   0.7827331819    0.0078824876
114 4886.2  4886    2   3649.1  3649    5   0   0.7767939208    0.0078358102
115 1725.1  1725    0   3649.1  3649    3   0   0.9985947281    0.0146724295
116 12.1    12  1   3649.1  3649    5   1   1.0093720796    0.0148314146
117 40  40  0   3602.2  3602    3   1   1.4149337468    0.0496880853
118 2728    2728    0   3602.2  3602    3   0   0.2540527003    0.0155628105
119 4786.1  4786    0   3602.2  3602    3   0   1.8863507604    0.0796133813

This is an sample entry for 'w' for df['taz'] == 4028 :

{3602: 1.0, 4027: 1.0, 4029: 1.0}

For row 1, I need to calculate df['P'].sum() where df['taz'] == 4028 , df['inc_grp] == 2 , and df['chosen'] == 1 . I also need the summation where df['hh_id'] != 11 , df['inc_grp] == 2 , and df['chosen'] == 1 . This should be added to the column df['V_comb'] . I need to do this for each row of the dataframe and the code is run multiple times because it is part of an optimization algorithm.

根据您编辑的帖子,这应该可以完成您想要的操作:

df['V_comb'] = df[(df['income_grp']==2) & (df['taz']==4028) & (df['chosen']==1)][['P','V_comb']].sum(axis=1)

I was able to drastically improve the runtime through a combination of changes. First, there was no reason to perform the filter on the dataframe everytime the optimization ran. I did this once in a for loop at the beginning of the program, which was optimized by placing it in a function and using cython. The result is a numpy array containing 0/1 for whether each condition is true between each pair of rows. I can then obtain the sum of the probabilities taking the dot product of this matrix with the vectorized form of the dataframe column. Most time is now spent in the optimization according to my profiling (easily improved by updating initial parameter values to the output of last run). A snippet of the code:

import numpy as np
cimport numpy as np

def get_filt_mat(long[:, :] X, double[:, :] Y, M):
    cdef int N = X.shape[0]
    cdef int[:] indices, indptr
    cdef int i, j

    indices = M.indices.astype(np.int32)
    indptr = M.indptr.astype(np.int32)
    cdef int I = indptr.shape[0]

    for i in range(N):
        for j in range(N):
            if X[i,0] == X[j,0] and X[j,3] == 1:
                if N<=I:
                    if indptr[i]==X[i,2] and indices[j]==X[j,2]:
                        Y[i,j] = 1
                if X[i,1] == X[j,1] and X[j,2] != X[i,2]:
                    Y[i,j] = 1
    return Y

Function call:

N = df.shape[0]
filtArray = np.zeros((N,N))

inArray = df[['income_grp', 'taz', 'hh_id', 'chosen']].values
outArray = get_filt_mat(inArray, filtArray, ws)
outArray = outArray.base

Application to dataframe column:

vectProb = df['P'].values
df['P_w'] = outArray.dot(vectProb) * beta_dict['RHO']

This is my first time using cython, and this is probably less than perfect code, but it now runs in about 10 minutes vs. 14 hours without completion using my original algorithms in pure python and pandas. I found these resources helpful (especially with getting cython to work with sparse matrices):

http://jakevdp.github.io/blog/2012/08/24/numba-vs-cython/

https://stackoverflow.com/questions/25295159/how-to-properly-pass-a-scipy-sparse-csr-matrix-to-a-cython-function

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM