简体   繁体   English

将更正应用于 dataframe 的子采样副本回到原始 dataframe?

[英]Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

I'm a Pandas newbie, so please bear with me.我是 Pandas 新手,所以请多多包涵。

Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each.概述:我从一个由数据收集脚本创建的自由格式文本文件开始,该脚本远程访问数十种不同类型的设备,以及每种设备的多个实例。 I used OpenRefine ( a truly wonderful tool ) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.我使用 OpenRefine(一个非常棒的工具)将其转换为 CSV,然后使用 JupyterLab 笔记本中的 Pandas 输入 dataframe df

My first inspection of the data showed the 'Timestamp' column was not monotonic.我对数据的第一次检查显示'Timestamp'列不是单调的。 I accessed individual data sources as follows, in this case for the 'T-meter' data source.我按如下方式访问了各个数据源,在本例中是'T-meter'数据源。 ( The technique was taken from a search result - I don't really understand it, but it worked. ) 该技术取自搜索结果 - 我不太了解,但它有效。

cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)

then checked each as follows:然后检查每个如下:

df_tmeter['Timestamp'].is_monotonic

Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated.幸运的是,这个问题很容易识别和修复:一些传感器正在重置,然后发送错误(但仍然是单调的)时间戳,直到它们的时钟更新。 I wrote the function healing() to cleanly patch such errors, and it worked a treat:我写了 function tracking healing()来干净地修补这些错误,它起到了治疗作用:

df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)

Now for my questions:现在我的问题:

  1. How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source'] ?如何仅将df['Source']中的'T-meter'项的'healed'值返回到原始df['Timestamp']列?

  2. Given the function healing() , is there a clean way to do this directly on df ?鉴于 function tracking healing() ,有没有一种干净的方法可以直接在df上执行此操作?

Thanks!谢谢!

Edit: I first thought I should be using 'views' into df , but other operations on the data would either generate errors, or silently turn the views into copies.编辑:我首先认为我应该在df中使用“视图”,但是对数据的其他操作要么会产生错误,要么会默默地将视图变成副本。

I wrote a wrapper function heal_row() for healing() :我写了一个包装 function heal_row()用于healing()

def heal_row( row ):
    if row['Source'] == 'T-meter':   # Redundant check, but safe!
        row['Timestamp'] = healing(row['Timestamp'])
    return row

then did the following:然后做了以下事情:

df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)

This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.这种排序很重要,因为healing()是基于先前行的有状态的,因此不能是默认操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM