简体   繁体   English

如何根据值之间的差异将值从一个 dataframe 列复制到另一列

[英]How can I copy values from one dataframe column to another based on the difference between the values

I have two csv mirror files generated by two different servers.我有两个由两个不同服务器生成的 csv 镜像文件。 Both files have the same number of lines and should have the exact same unix timestamp column.这两个文件具有相同的行数,并且应该具有完全相同的 unix 时间戳列。 However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:但是,由于某些时钟问题,一个文件中的某些记录可能与另一个 csv 文件中的对应记录有一个纳秒的小差异,请参见下面的示例,差异始终为 1:

dataframe_A                                          dataframe_B

|          | ts_ns              |            |          | ts_ns              |
| -------- | ------------------ |            | -------- | ------------------ |
| 1        | 1661773636777407794|            | 1        | 1661773636777407793|
| 2        | 1661773636786474677|            | 2        | 1661773636786474677|
| 3        | 1661773636787956823|            | 3        | 1661773636787956823|
| 4        | 1661773636794333099|            | 4        | 1661773636794333100|

Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.由于这些是具有数百万行的巨大文件,我使用 pandas 和 dask 来处理它们,但在处理之前,我需要确保它们具有相同的时间戳列。 I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.我需要检查 A 和 B 中的 ts_ns 列之间的差异,如果存在 1 或 -1 的差异,我需要将 B 中的值替换为 A 中相应的 ts_ns 值,这样我最终可以在两者中拥有相同的 ts_ns 值相应记录的文件。

How can I do this in a decent way using pandas/dask?如何使用 pandas/dask 以体面的方式做到这一点?

If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?如果您确定时间戳应该相同,为什么不简单地使用 dataframe A 中的时间戳列并用它覆盖 dataframe B 中的时间戳列?

Why even check whether the difference is there or not?为什么还要检查是否存在差异?

You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance allows for a int or timedelta which should be set to the +1 for your example with direction being nearest . tolerance允许使用 int 或 timedelta ,对于您的示例,应将其设置为 +1, directionnearest

Assuming your files are identical except from your ts_ns column you can perform a .merge on indices.假设您的文件是相同的,除了您的ts_ns列,您可以对索引执行.merge

df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})

df_b = (df_b
    .merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
    .assign(
        ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
    )
    .loc[:, ['ts_ns']]
)

But I agree with @ManEngel, just overwrite all the values if you know they are identical.但我同意@ManEngel,如果您知道它们相同,只需覆盖所有值。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据 Pandas 中的列值将内容从一个 Dataframe 复制到另一个 Dataframe - Copy contents from one Dataframe to another based on column values in Pandas 将值从一个 dataframe 列复制到另一列 - Copy values from one dataframe column to another 如果索引值相同,如何将一个DataFrame列复制到另一个Dataframe中 - How to copy one DataFrame column in to another Dataframe if their indexes values are the same Pandas,如何避免使用 iterrow(如何根据来自另一个数据帧的值将值分配给 dataframe 中的新列) - Pandas, how can I avoid the use of iterrow (how to assign values to a new column in a dataframe based on the values from another dataframe) 如何将列值从一个 dataframe 提取到另一个? - How do I extract column values from one dataframe to another? 如何将 dataframe 中的每一行与另一个 dataframe 中的每一行进行比较,并查看值之间的差异? - How can I compare each row from a dataframe against every row from another dataframe and see the difference between values? 根据另一个 dataframe 的列值打印一个 dataframe 的列值 - print column values of one dataframe based on the column values of another dataframe 基于另一个数据框将值从一列滚动到另一列 - Rolling over values from one column to other based on another dataframe 基于相同的日期时间将具有来自一个 dataframe 的值的列添加到另一个 - Adding column with values from one dataframe to another based on same datetime 如何在DataFrame中使用另一列中的值减去一列中的值? - How do I Substract values in one column with the values of another in a DataFrame?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM