[英]Transferring values from multiple columns in a dataframe to a new column in another dataframe, based on time-criterion
I'm pretty new to Python, and start to realize its potential for serious number-crunching. 我对Python还是很陌生,并开始意识到它有可能进行严重的数字运算。
Presently, I wish to create a new column in a pandas dataframe (df1) and populate it with the numeric value from one of 24 columns (named 0.0
- 23.0
) in another dataframe (df2). 目前,我希望创建一个数据框大熊猫(DF1)新的一列,并与来自24列(命名的一个数值来填充它
0.0
- 23.0
)在另一个数据帧(DF2)。 Each of columns 0.0
- 23.0
represents hours ( 00.00-00.59
, 01.00-01.59
and so on). 各列的
0.0
- 23.0
表示小时( 00.00-00.59
, 01.00-01.59
等等)。 I want to perform my operation based on a time criterion. 我想根据时间标准执行操作。
There is a column time
in df1, with a datetime value in the format YYYY-mm-dd HH:MM:SS
. df1中有一个列
time
,其日期时间值格式YYYY-mm-dd HH:MM:SS
。 This column is not the index column of df1, so several rows may have the same value of 'time'. 该列不是df1的索引列,因此几行可能具有相同的“时间”值。 df1 contains a total of 300,000 rows.
df1总共包含300,000行。
The index of df2 is a column date
which contains values in the form YYYY-mm-dd
. df2的索引是一列
date
,其中包含格式YYYY-mm-dd
。 df2 covers 3 years and hence contains a total of about 1,200 rows. df2涵盖3年,因此总共包含约1,200行。
For example, if the value of time
in df1 is 2011-01-01 12:01:20
, I want to populate the new column in df1 with the numeric value from column 12.0
in df2, corresponding to the row with index 2011-01-01
. 例如,如果df1中的
time
值为2011-01-01 12:01:20
,我想用df2中12.0
列中的数值填充df1中的新列,该数值对应于具有索引2011-01-01
的行2011-01-01
。
I have tried to merge the two dataframes and obtained a new dataframe containing df1 and the columns 0.0
- 23.0
matched to the correct date. 我试图在两个dataframes合并,并获得新的数据框包含DF1和列
0.0
- 23.0
匹配正确的日期。 I did this by converting 'time' to the YYYY-mm-dd
format and applying .merge. 我是通过将“时间”转换为
YYYY-mm-dd
格式并应用.merge来实现的。 However, this dataframe is a bit too messy. 但是,此数据帧太混乱了。
Furthermore, I would like to write a function evaluating the new column in df1, to allow for a backward control that the imported values from df2 are correct. 此外,我想编写一个函数来评估df1中的新列,以允许向后控制从df2导入的值是正确的。
df1 DF1
KEY time
252752 2011-01-01 04:20:00
281789 2011-01-02 01:18:00
242674 2011-01-03 03:08:00
189497 2011-01-04 00:17:00
189498 2011-01-05 05:31:00
... ...
df2 DF2
date 0.0 1.0 2.0 3.0 4.0 5.0 ... 23.0
2011-01-01 0.919355 0.925806 0.929032 0.932258 0.938710 0.953947 ... 1.037975
2011-01-02 1.026144 1.019608 1.022876 1.032680 1.035948 1.035948 ... 0.919355
2011-01-03 1.025316 1.034810 1.037975 1.034810 1.044304 1.044304 ... 1.018987
2011-01-04 1.018987 1.025316 1.031646 1.044304 1.047468 1.050633 ... 0.932258
2011-01-05 1.018987 1.018987 1.018987 1.022152 1.031646 1.037975 ... 0.953947
... ... ... ... ... ... ... ... ...
desired result 理想的结果
KEY time value
252752 2011-01-01 04:20:00 0.938710
281789 2011-01-02 01:18:00 1.019608
242674 2011-01-03 03:08:00 1.034810
189497 2011-01-04 00:17:00 1.018987
189498 2011-01-05 05:31:00 1.037975
... ... ...
im not sure if this helps... but thats the way i would write it: 我不确定这是否有帮助...但这就是我写的方式:
### just to have your test data
df1_val = ("252752 2011-01-01 04:20:00",
"281789 2011-01-02 01:18:00",
"242674 2011-01-03 03:08:00",
"189497 2011-01-04 00:17:00",
"189498 2011-01-05 05:31:00")
df1 = {}
for row in df1_val:
df1[row[0:5]]= (row[7:17], row[18:])
df2_val = ( "2011-01-01 0.919355 0.925806 0.929032 0.932258 0.938710 0.953947",
"2011-01-02 1.026144 1.019608 1.022876 1.032680 1.035948 1.035948",
"2011-01-03 1.025316 1.034810 1.037975 1.034810 1.044304 1.044304",
"2011-01-04 1.018987 1.025316 1.031646 1.044304 1.047468 1.050633",
"2011-01-05 1.018987 1.018987 1.018987 1.022152 1.031646 1.037975")
df2 = {}
for row in df2_val:
date, zero, one, two, three, four, five = row.split(" ")
df2[date] = (zero, one, two, three, four, five)
#### build the result dict
result = {}
for key in df1:
hour = int(df1[key][1][:2])
date = df1[key][0]
result[key] = (df1[key][0] + " " + df1[key][1], df2[date][hour], )
print key
print result[key]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.