简体   繁体   English

将数据帧与熊猫合并

[英]merging DataFrames with pandas

I have multiple files containing dates and measured values.我有多个包含日期和测量值的文件。 Their setup is identical:它们的设置是相同的:

YYYY  MM  DD  val1
YYYY  MM  DD  val2
YYYY  MM  DD  val3

I use the following to read each of these files into a DataFrame我使用以下内容将这些文件中的每一个读入DataFrame

for cur_file in file_list:
    cur_df = pa.io.parsers.read_table(os.path.join(data_path, result)
                                                , header=None
                                                , sep='\s*'
                                                , parse_dates=[[0,1, 2]]
                                                , names=['day','month', 'hour', cur_file[:-4]]
                                                , index_col=[0]
                                                )

The dates are not identical in all files.并非所有文件中的日期都相同。 There is sometimes some overlap, but not always.有时会有一些重叠,但并非总是如此。

I could plot each of the cur_df individually via我可以通过单独绘制每个 cur_df

cur_df.plot()

in the loop.在循环。

It seems like it would be a good idea to have all the cur_df in one "big" DataFrame.将所有cur_df放在一个“大”DataFrame 中似乎是个好主意。 Both for plotting and also for statistics later on.既用于绘图,也用于稍后的统计。 How would this be done ideally, considering they have not the same dates?考虑到它们的日期不同,这将如何理想地完成? Is there a way to "merge" multiple DataFrames dates that occur only in one of the underlying DataFrames?有没有办法“合并”仅在其中一个基础数据帧中出现的多个数据帧日期?

I guess I am looking for a data frame that looks like this:我想我正在寻找一个看起来像这样的数据框:

YYYY MM DD  val1(from1)  NaN
YYYY MM DD  val2(from1)  val2(from2)
YYYY MM DD  NaN          val3(from2)

It would take the date stamp in the first line from the date of val1, in line two the dates of val1 and val2 are identical, and it would take the date in line 3 based on val2第一行的日期戳取自 val1 的日期,第二行中 val1 和 val2 的日期相同,并根据 val2 取第三行中的日期

I looked into cur_df.add(cur_df2) appends the two DataFrames.我查看了 cur_df.add(cur_df2) 附加了两个数据帧。 I am not sure what cur_df.combine(cur_df2, ...) would do, especially since I am not sure what function should be used as second argument.我不确定 cur_df.combine(cur_df2, ...) 会做什么,特别是因为我不确定应该使用哪个函数作为第二个参数。

Thanks for your help, Cheers, Claus感谢您的帮助,干杯,克劳斯

from your code snippet it looks like the parsed date value should be the index and each DataFrame will have the values in a different column name right?从您的代码片段看起来解析的日期值应该是索引,并且每个 DataFrame 将具有不同列名中的值,对吗? In that case I think an iterative call to DataFrame.combine_first should do the trick.在这种情况下,我认为对DataFrame.combine_first的迭代调用应该可以解决问题。

Also, are you passing in "keep_date_col=True" as well?另外,您是否也传入了“keep_date_col=True”? By default the parser should be throwing away the component date columns when parsing multiple date components into one (if not then that's a bug so please let me know).默认情况下,解析器在将多个日期组件解析为一个时应该丢弃组件日期列(如果没有,那么这是一个错误,所以请告诉我)。

Best,最好,

Chang

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM