[英]How can I copy values from one dataframe column to another based on the difference between the values
I have two csv mirror files generated by two different servers.我有两个由两个不同服务器生成的 csv 镜像文件。 Both files have the same number of lines and should have the exact same unix timestamp column.这两个文件具有相同的行数,并且应该具有完全相同的 unix 时间戳列。 However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:但是,由于某些时钟问题,一个文件中的某些记录可能与另一个 csv 文件中的对应记录有一个纳秒的小差异,请参见下面的示例,差异始终为 1:
dataframe_A dataframe_B
| | ts_ns | | | ts_ns |
| -------- | ------------------ | | -------- | ------------------ |
| 1 | 1661773636777407794| | 1 | 1661773636777407793|
| 2 | 1661773636786474677| | 2 | 1661773636786474677|
| 3 | 1661773636787956823| | 3 | 1661773636787956823|
| 4 | 1661773636794333099| | 4 | 1661773636794333100|
Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.由于这些是具有数百万行的巨大文件,我使用 pandas 和 dask 来处理它们,但在处理之前,我需要确保它们具有相同的时间戳列。 I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.我需要检查 A 和 B 中的 ts_ns 列之间的差异,如果存在 1 或 -1 的差异,我需要将 B 中的值替换为 A 中相应的 ts_ns 值,这样我最终可以在两者中拥有相同的 ts_ns 值相应记录的文件。
How can I do this in a decent way using pandas/dask?如何使用 pandas/dask 以体面的方式做到这一点?
If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?如果您确定时间戳应该相同,为什么不简单地使用 dataframe A 中的时间戳列并用它覆盖 dataframe B 中的时间戳列?
Why even check whether the difference is there or not?为什么还要检查是否存在差异?
You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance
allows for a int or timedelta which should be set to the +1 for your example with direction
being nearest
. tolerance
允许使用 int 或 timedelta ,对于您的示例,应将其设置为 +1, direction
为nearest
。
Assuming your files are identical except from your ts_ns
column you can perform a .merge
on indices.假设您的文件是相同的,除了您的ts_ns
列,您可以对索引执行.merge
。
df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})
df_b = (df_b
.merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
.assign(
ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
)
.loc[:, ['ts_ns']]
)
But I agree with @ManEngel, just overwrite all the values if you know they are identical.但我同意@ManEngel,如果您知道它们相同,只需覆盖所有值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.