简体   繁体   English

熊猫-使用for循环将多个列附加到数据框

[英]pandas - using a for loop to append multiple columns to a dataframe

I want to populate the empty columns 'web' 'mob 'app' by summing for each of the relevant dates in df2 我想通过汇总df2中的每个相关日期来填充空列'web''mob'app'

df1: DF1:

id      start       end         web mob app
12345   2018-01-17  2018-01-20
12346   2018-01-19  2018-01-22
12347   2018-01-20  2018-01-23
12348   2018-01-20  2018-01-23
12349   2018-01-21  2018-01-24

df2: DF2:

id      date        web mob app
12345   2018-01-17  7   17  10
12345   2018-01-18  9   18  7
12345   2018-01-19  3   19  15
12345   2018-01-20  6   17  8
12345   2018-01-21  8   9   13
12345   2018-01-22  4   15  12
12345   2018-01-23  8   11  13
12345   2018-01-24  9   16  14
12346   2018-01-17  3   17  12
12346   2018-01-18  4   19  4
12346   2018-01-19  6   13  10
12346   2018-01-20  1   15  6
12346   2018-01-21  4   12  11
12346   2018-01-22  5   20  12
12346   2018-01-23  8   13  14
12346   2018-01-24  6   18  8

This for loop will populate the 'web' column: 此for循环将填充“ web”列:

column = []

for i in df1.index:
    column.append(df2[(df2['date'] >= df1['start'].iloc[i]) 
        & (df2['date'] <= df1['end'].iloc[i]) 
        & (df2['id'] == df1['id'].iloc[i])].sum()['web'])

df1['web'] = column

I want to be able to populate all 3 columns with one for loop, rather than doing 3 separate loops. 我希望能够用一个for循环填充所有3列,而不是执行3个单独的循环。

I have a feeling that using something like appending this 我有一种感觉,就是使用类似附加的内容

.agg({'web':'sum', 'mob':'sum', 'app':'sum'})

to a 2 dimensional list could be the answer. 二维列表可能是答案。

Also... is there a more efficient way to do this than using for loops? 另外...有比使用for循环更有效的方法吗? Maybe by using numpy.where? 也许通过使用numpy.where? I'm finding that running multiple for loops over large data sets can be very very slow. 我发现对大型数据集运行多个for循环可能非常慢。

IIUC IIUC

s=df1.merge(df2,on='id',how='left')
output=s[(s.start<=s.date)&(s.end>=s.date)].groupby('id').sum()
output
Out[991]: 
        web   mob   app
id                     
12345  25.0  71.0  40.0
12346  16.0  60.0  39.0

Then we using merge again 然后我们再次使用merge

df1.merge(output.reset_index(),how='left').fillna(0)
Out[995]: 
      id      start        end   web   mob   app
0  12345 2018-01-17 2018-01-20  25.0  71.0  40.0
1  12346 2018-01-19 2018-01-22  16.0  60.0  39.0
2  12347 2018-01-20 2018-01-23   0.0   0.0   0.0
3  12348 2018-01-20 2018-01-23   0.0   0.0   0.0
4  12349 2018-01-21 2018-01-24   0.0   0.0   0.0

This is one way, but it is not "pandonic". 这是一种方法,但不是“泛函式”。 It assumes your date columns are already converted to datetime . 假定您的日期列已转换为datetime But use @Wen's vectorised solution . 但是使用@Wen的矢量化解决方案

def filtersum(row):

    result = [(w, m, a) for i, w, m, a, d  in \
              zip(df2.id, df2.web, df2.mob, df2.app, df2.date) \
              if i == row['id'] and (row['start'] <= d <= row['end'])]

    return [sum(i) for i in (zip(*result))] if result else [0, 0, 0]

df1[['web', 'mob', 'app']] = df1.apply(filtersum, axis=1)

#       id      start        end  web  mob  app
# 0  12345 2018-01-17 2018-01-20   25   71   40
# 1  12346 2018-01-19 2018-01-22   16   60   39
# 2  12347 2018-01-20 2018-01-23    0    0    0
# 3  12348 2018-01-20 2018-01-23    0    0    0
# 4  12349 2018-01-21 2018-01-24    0    0    0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM