[英]pandas - using a for loop to append multiple columns to a dataframe
I want to populate the empty columns 'web' 'mob 'app' by summing for each of the relevant dates in df2 我想通过汇总df2中的每个相关日期来填充空列'web''mob'app'
df1: DF1:
id start end web mob app
12345 2018-01-17 2018-01-20
12346 2018-01-19 2018-01-22
12347 2018-01-20 2018-01-23
12348 2018-01-20 2018-01-23
12349 2018-01-21 2018-01-24
df2: DF2:
id date web mob app
12345 2018-01-17 7 17 10
12345 2018-01-18 9 18 7
12345 2018-01-19 3 19 15
12345 2018-01-20 6 17 8
12345 2018-01-21 8 9 13
12345 2018-01-22 4 15 12
12345 2018-01-23 8 11 13
12345 2018-01-24 9 16 14
12346 2018-01-17 3 17 12
12346 2018-01-18 4 19 4
12346 2018-01-19 6 13 10
12346 2018-01-20 1 15 6
12346 2018-01-21 4 12 11
12346 2018-01-22 5 20 12
12346 2018-01-23 8 13 14
12346 2018-01-24 6 18 8
This for loop will populate the 'web' column: 此for循环将填充“ web”列:
column = []
for i in df1.index:
column.append(df2[(df2['date'] >= df1['start'].iloc[i])
& (df2['date'] <= df1['end'].iloc[i])
& (df2['id'] == df1['id'].iloc[i])].sum()['web'])
df1['web'] = column
I want to be able to populate all 3 columns with one for loop, rather than doing 3 separate loops. 我希望能够用一个for循环填充所有3列,而不是执行3个单独的循环。
I have a feeling that using something like appending this 我有一种感觉,就是使用类似附加的内容
.agg({'web':'sum', 'mob':'sum', 'app':'sum'})
to a 2 dimensional list could be the answer. 二维列表可能是答案。
Also... is there a more efficient way to do this than using for loops? 另外...有比使用for循环更有效的方法吗? Maybe by using numpy.where? 也许通过使用numpy.where? I'm finding that running multiple for loops over large data sets can be very very slow. 我发现对大型数据集运行多个for循环可能非常慢。
IIUC IIUC
s=df1.merge(df2,on='id',how='left')
output=s[(s.start<=s.date)&(s.end>=s.date)].groupby('id').sum()
output
Out[991]:
web mob app
id
12345 25.0 71.0 40.0
12346 16.0 60.0 39.0
Then we using merge
again 然后我们再次使用merge
df1.merge(output.reset_index(),how='left').fillna(0)
Out[995]:
id start end web mob app
0 12345 2018-01-17 2018-01-20 25.0 71.0 40.0
1 12346 2018-01-19 2018-01-22 16.0 60.0 39.0
2 12347 2018-01-20 2018-01-23 0.0 0.0 0.0
3 12348 2018-01-20 2018-01-23 0.0 0.0 0.0
4 12349 2018-01-21 2018-01-24 0.0 0.0 0.0
This is one way, but it is not "pandonic". 这是一种方法,但不是“泛函式”。 It assumes your date columns are already converted to datetime
. 假定您的日期列已转换为datetime
。 But use @Wen's vectorised solution . 但是使用@Wen的矢量化解决方案 。
def filtersum(row):
result = [(w, m, a) for i, w, m, a, d in \
zip(df2.id, df2.web, df2.mob, df2.app, df2.date) \
if i == row['id'] and (row['start'] <= d <= row['end'])]
return [sum(i) for i in (zip(*result))] if result else [0, 0, 0]
df1[['web', 'mob', 'app']] = df1.apply(filtersum, axis=1)
# id start end web mob app
# 0 12345 2018-01-17 2018-01-20 25 71 40
# 1 12346 2018-01-19 2018-01-22 16 60 39
# 2 12347 2018-01-20 2018-01-23 0 0 0
# 3 12348 2018-01-20 2018-01-23 0 0 0
# 4 12349 2018-01-21 2018-01-24 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.