[英]Python pandas make new column from data in existing column and from another dataframe
I have a DataFrame called 'mydata', and if I do 我有一个名为“ mydata”的DataFrame,如果我这样做
len(mydata.loc['2015-9-2'])
It counts the number of rows in mydata that have that date, and returns a number like 它计算mydata中具有该日期的行数,并返回类似
1067
I have another DataFrame called 'yourdata' which looks something like 我还有一个名为“ yourdata”的数据框,看起来像
timestamp
51 2015-06-22
52 2015-06-23
53 2015-06-24
54 2015-06-25
43 2015-07-13
Now I want use each date in yourdata so instead of typing in each date 现在我要使用数据中的每个日期,而不是键入每个日期
len(mydata.loc['2015-9-2'])
I can iterate through 'yourdata' using them like 我可以像这样使用它们遍历“ yourdata”
len(mydata.loc[yourdata['timestamp']])
and produce a new DataFrame with the results or just add a new column to yourdata with the result for each date, but I'm lost as how to do this? 并使用结果生成一个新的DataFrame或仅将每个日期的结果添加到您的数据中的新列,但是我不知道该怎么做?
The following does not work 以下不起作用
yourdata['result'] = len(mydata.loc[yourdata['timestamp']])
neither does this 这也不
yourdata['result'] = len(mydata.loc[yourdata.iloc[:,-3]])
this does work 这确实有效
yourdata['result'] = len(mydata.loc['2015-9-2'])
buts that no good as I want to use the date in each row not some fixed date. 但是那不好,因为我想使用每一行中的日期而不是某个固定日期。
Edit : first few rows of mydata 编辑 :mydata的前几行
timestamp BPM
0 2015-08-30 16:48:00 65
1 2015-08-30 16:48:10 65
2 2015-08-30 16:48:15 66
3 2015-08-30 16:48:20 67
4 2015-08-30 16:48:30 70
import numpy as np
import pandas as pd
mydata = pd.DataFrame({'timestamp': ['2015-06-22 16:48:00']*3 +
['2015-06-23 16:48:00']*2 +
['2015-06-24 16:48:00'] +
['2015-06-25 16:48:00']*4 +
['2015-07-13 16:48:00',
'2015-08-13 16:48:00'],
'BPM': [65]*8 + [70]*4})
mydata['timestamp'] = pd.to_datetime(mydata['timestamp'])
print(mydata)
# BPM timestamp
# 0 65 2015-06-22 16:48:00
# 1 65 2015-06-22 16:48:00
# 2 65 2015-06-22 16:48:00
# 3 65 2015-06-23 16:48:00
# 4 65 2015-06-23 16:48:00
# 5 65 2015-06-24 16:48:00
# 6 65 2015-06-25 16:48:00
# 7 65 2015-06-25 16:48:00
# 8 70 2015-06-25 16:48:00
# 9 70 2015-06-25 16:48:00
# 10 70 2015-07-13 16:48:00
# 11 70 2015-08-13 16:48:00
yourdata = pd.Series(['2015-06-22', '2015-06-23', '2015-06-24',
'2015-06-25', '2015-07-13'], name='timestamp')
yourdata = pd.to_datetime(yourdata).to_frame()
print(yourdata)
# 0 2015-06-22
# 1 2015-06-23
# 2 2015-06-24
# 3 2015-06-25
# 4 2015-07-13
result = (mydata.set_index('timestamp').resample('D')
.size().loc[yourdata['timestamp']]
.reset_index())
result.columns = ['timestamp', 'result']
print(result)
# timestamp result
# 0 2015-06-22 3
# 1 2015-06-23 2
# 2 2015-06-24 1
# 3 2015-06-25 4
# 4 2015-07-13 1
I think you need value_counts
, but first convert to dates by dt.date
, convert to to_datetime
and last use join
: 我认为您需要
value_counts
,但首先要通过dt.date
转换为日期, dt.date
转换为to_datetime
,最后使用join
:
print (yourdata.join(pd.to_datetime(mydata.timestamp.dt.date)
.value_counts()
.rename('len'), on='timestamp'))
Sample: 样品:
print (mydata)
timestamp BPM
0 2015-06-23 16:48:00 65
1 2015-06-23 16:48:10 65
2 2015-06-23 16:48:15 66
3 2015-06-23 16:48:20 67
4 2015-06-22 16:48:30 70
print (yourdata)
timestamp
51 2015-06-22
52 2015-06-23
53 2015-06-24
54 2015-06-25
43 2015-07-13
#if dtype not datetime
mydata['timestamp'] = pd.to_datetime(mydata['timestamp'])
yourdata['timestamp'] = pd.to_datetime(yourdata['timestamp'])
print (yourdata.join(pd.to_datetime(mydata.timestamp.dt.date)
.value_counts()
.rename('len'), on='timestamp'))
timestamp len
51 2015-06-22 1.0
52 2015-06-23 4.0
53 2015-06-24 NaN
54 2015-06-25 NaN
43 2015-07-13 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.