简体   繁体   English

在列相同且时间戳接近的 DataFrame 中删除重复项

[英]Drop Duplicates in a DataFrame where a column are identical and have near timestamps

Currently i have the following dataframe :目前我有以下数据框:

    index         timestamp      | id_a | id_b | id_pair
   --------------------------------------------------------
     0       2020-01-01 00:00:00 | 1    | A    |   1A
     1       2020-01-01 00:01:30 | 1    | A    |   1A
     2       2020-01-01 00:02:30 | 1    | A    |   1A
     3       2020-01-01 00:07:30 | 1    | A    |   1A
     4       2020-01-01 00:00:00 | 2    | B    |   2B
     5       2000-01-01 00:00:00 | 3    | C    |   3C
     6       2000-01-01 00:00:00 | 4    | D    |   4D

With dataframe i want to drop the rows who have the same id_pair and timestamp with the range of X minutes, lets say 5 minutes.使用数据框,我想删除具有相同 id_pair 和时间戳的行,范围为 X 分钟,比如说 5 分钟。 And therefore the expected result are like this :因此,预期的结果是这样的:

    index         timestamp      | id_a | id_b | id_pair
   --------------------------------------------------------
     0       2020-01-01 00:00:00 | 1    | A    |   1A
     3       2020-01-01 00:07:30 | 1    | A    |   1A
     4       2020-01-01 00:00:00 | 2    | B    |   2B
     5       2000-01-01 00:00:00 | 3    | C    |   3C
     6       2000-01-01 00:00:00 | 4    | D    |   4D

After searching to the stackoverflow question, i stumble on this question which has similar problem to mine在搜索到 stackoverflow 问题后,我偶然发现了与我有类似问题的这个问题
Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical 如果时间戳关闭但不相同,则删除 DataFrame 中的重复项



I've recreated the code so that it fits my needs (pretty much the same), and the code looks like this我重新创建了代码,使其符合我的需要(几乎相同),代码如下所示

mask1 = df.groupby('id_pair').timestamp.apply(lambda x: x.diff().dt.seconds < 300)
mask2 = df.unique_contact.duplicated(keep=False) & (mask1 | mask1.shift(-1))
df[~mask2]

But when i run the code i'm encountering this error :但是当我运行代码时,我遇到了这个错误:

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Any help or advice would be apreciated任何帮助或建议将不胜感激
Thanks in advance提前致谢



Python version : 3.6.12 Python 版本:3.6.12
Pandas version : 0.25.3熊猫版本:0.25.3

First convert column to datetime s and then for expected output remove | mask1.shift(-1)首先将列转换为datetime s,然后为预期输出删除| mask1.shift(-1) | mask1.shift(-1) : | mask1.shift(-1)

df['timestamp'] = pd.to_datetime(df['timestamp'])
mask1 = df.groupby('id_pair').timestamp.apply(lambda x: x.diff().dt.seconds < 300)
mask2 = df.id_pair.duplicated(keep=False) & mask1
df = df[~mask2]
print (df)
   index  timestamp  id_a id_b id_pair
0      0 2020-01-01     1    A      1A
2      2 2020-01-01     2    B      2B
3      3 2000-01-01     3    C      3C
4      4 2000-01-01     4    D      4D

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM