如果 pandas 列值等于另一列的值，则更新它们

Question

I have a pandas dataframe containing account IDs, home, work, and mobile phone numbers.我有一个 pandas 数据框，其中包含帐户 ID、家庭、工作和手机号码。 All of these values are strings.所有这些值都是字符串。 My goal is to update the values of each row such that all duplicate numbers both within the same row and across different rows are set to NaN, leaving one 'original' number.我的目标是更新每一行的值，使同一行和不同行中的所有重复数字都设置为 NaN，留下一个“原始”数字。 How can I accomplish this in an efficient way?我怎样才能有效地完成这项工作？

When updating values in the same row, priority is given to home phone first and then work phone second.更新同一行中的值时，优先考虑家庭电话，然后是工作电话。 So if home == work == mobile , both work and mobile are updated to NaN.因此，如果home == work == mobile ，则 work 和 mobile 都会更新为 NaN。 If home != work == mobile , then mobile is updated to NaN.如果home != work == mobile ，则 mobile 更新为 NaN。 When updating values in different rows, it does not matter which duplicate phone number is kept as the 'original' number.更新不同行中的值时，将哪个重复电话号码保留为“原始”号码并不重要。 For example, if A['home'] == B['mobile'] == C['work'] , two of those values should be set to NaN and the remaining one remain unchanged.例如，如果A['home'] == B['mobile'] == C['work'] ，其中两个值应设置为 NaN，其余一个保持不变。 I have chosen to keep the first number and set the other duplicate numbers to NaN when displaying the dataframe.在显示数据框时，我选择保留第一个数字并将其他重复数字设置为 NaN。

I've figured out how to update values within the same row using df.loc , but I've been unsuccessful in figuring out how I can also accomplish updating duplicate values to NaN across different rows and columns.我已经想出如何使用df.loc更新同一行内的值，但我一直没能弄清楚如何将不同行和列的重复值更新为 NaN。 How can I achieve this?我怎样才能做到这一点？

Below is further information on what I'm trying to do and where I'm getting stuck:以下是有关我正在尝试做什么以及我遇到困难的更多信息：

My initial dataframe looks something like this:我的初始数据框看起来像这样：

acct_id        home        work      mobile
      A  1111111111  1111111111  1111111111
      B  2222222222  2222222222  2222222222
      C  3333333333  3333333333  3333333333
      D  4444444444  5555555555  5555555555
      E  6666666666  7777777777  8888888888
      F  9999999999  9999999999  8888888888
      G  7777777777  6666666666  5555555555
      H  4444444444  3333333333  2222222222
      I         NaN         NaN         NaN

and my goal is to update the dataframe so that it looks like this:我的目标是更新数据框，使其看起来像这样：

acct_id        home        work      mobile
      A  1111111111         NaN         NaN
      B  2222222222         NaN         NaN
      C  3333333333         NaN         NaN
      D  4444444444  5555555555         NaN
      E  6666666666  7777777777  8888888888
      F  9999999999         NaN         NaN
      G         NaN         NaN         NaN
      H         NaN         NaN         NaN
      I         NaN         NaN         NaN

I'm currently approaching this as a 2 step problem.我目前正在将其作为两步问题来处理。 Step 1 is removing duplicate numbers in the same row.第 1 步是删除同一行中的重复数字。 Step 2 is removing duplicate numbers across different rows and different columns.第 2 步是删除跨不同行和不同列的重复数字。 I have figured out step 1, using the df.loc command:我已经找到了第 1 步，使用df.loc命令：

df.loc[df['home'] == df['work'], ['work']] = np.nan
df.loc[df['home'] == df['mobile'], ['mobile']] = np.nan
df.loc[df['work'] == df['mobile'], ['mobile']] = np.nan

This is what my dataframe looks like after running the above commands:这是运行上述命令后我的数据框的样子：

acct_no        home        work      mobile
      A  1111111111         NaN         NaN
      B  2222222222         NaN         NaN
      C  3333333333         NaN         NaN
      D  4444444444  5555555555         NaN
      E  6666666666  7777777777  8888888888
      F  9999999999         NaN  8888888888
      G  7777777777  6666666666  5555555555
      H  4444444444  3333333333  2222222222
      I         NaN         NaN         NaN

However, I can't wrap my head around step 2. As a brute force method, I have found that I can sort the dataframe on home and then loop through each row, checking if the previous row's home value is the same as the current row's, setting the current row's value to nan if it is the same.但是，我无法绕过第 2 步。作为一种蛮力方法，我发现我可以对 home 上的数据帧进行排序，然后遍历每一行，检查前一行的 home 值是否与当前行相同行的，如果当前行的值相同，则将其设置为 nan。 Lastly, I would have to repeat that process for the work and mobile keys.最后，我将不得不为工作密钥和移动密钥重复该过程。 This is what the code for checking the home field would look like:这是检查主场的代码的样子：

df.sort_values(by='home', inplace=True)
prev_row = {'home':None,'work':None,'mobile':None}
    for cur_idx,cur_row in df.iterrows():
        if prev_row['home'] == cur_row['home']:
            cur_row['home'] = np.nan
        prev_row = cur_row

After running the above code just for updating and checking on the home column, my dataframe will look like this:在运行上面的代码只是为了更新和检查主页列之后，我的数据框将如下所示：

acct_no        home        work      mobile
      A  1111111111         NaN         NaN
      B  2222222222         NaN         NaN
      C  3333333333         NaN         NaN
      D  4444444444  5555555555         NaN
      E         NaN  3333333333  2222222222
      F  6666666666  7777777777  8888888888
      G  7777777777  6666666666  5555555555
      H  9999999999         NaN  8888888888
      I         NaN         NaN         NaN

This solution is pretty hacky and not efficient for larger datasets, so how can I achieve this in a more efficient manner?这个解决方案非常笨拙，对于较大的数据集效率不高，那么我怎样才能以更有效的方式实现这一目标呢？

Any help is greatly appreciated -- thank you in advance!非常感谢任何帮助 - 在此先感谢您！

Answer 1

This might address your step 2 needs.这可能会解决您第 2 步的需求。 If not, feel free to go with another approach.如果没有，请随意采用另一种方法。

df = pd.DataFrame(
    [
        dict(acct_no="D", home="4444444444", work="5555555555"),
        dict(acct_no="E", home=np.NaN, work="3333333333", mobile="2222222222"),
        dict(acct_no="J", home=np.NaN, work=np.NaN, mobile="8888888888"),
        dict(acct_no="K", home=np.NaN, work="8888888888"),
        dict(acct_no="L", home=np.NaN, work=np.NaN, mobile="8888888888"),
    ]
)
df["phone"] = (df.home
               .combine_first(df.work)
               .combine_first(df.mobile))
df = (df.sort_values(by="phone")
      .drop_duplicates(subset="phone")
      .set_index("acct_no"))
print(df)

output输出

               home        work      mobile       phone
acct_no                                                
E               NaN  3333333333  2222222222  3333333333
D        4444444444  5555555555         NaN  4444444444
J               NaN         NaN  8888888888  8888888888

In this implementation we are only looking at the phone column, which is the preferred number for an account.在此实现中，我们只查看phone列，这是帐户的首选号码。 That might be a bit more draconian that desired.这可能比期望的更严厉一些。 Notice for example that accounts "K" and "L" were nuked entirely, on the basis of sharing a phone number with "J".例如，请注意帐户“K”和“L”在与“J”共享电话号码的基础上完全被删除。 If multiple customers share a Home land line, that might not be the desired business logic.如果多个客户共享一条家庭固定电话，那可能不是所需的业务逻辑。 Notice also that if "K" were to add a Home number of 7878787878 he would survive, despite the 8888888888 dup.另请注意，如果“K”要添加家庭号码 7878787878，他将存活下来，尽管 8888888888 dup。 If Mobile is "more unique" than Home, perhaps we should prefer that number.如果移动比家庭“更独特”，也许我们应该更喜欢那个数字。

Now that we have used the phone column to good advantage, feel free to.drop() it.现在我们已经充分利用了phone列，可以随意使用 .drop() 了。

The sort costs O(N log N), and everything else is linear, so this should be a performant solution, even for large datasets.排序成本为 O(N log N)，其他一切都是线性的，因此这应该是一个高性能的解决方案，即使对于大型数据集也是如此。

Answer 2

In my opinion, the simplest would be to stack the numbers as Series, mask or drop_duplicates and then restore the original shape:在我看来，最简单的方法是将数字stack为 Series、 mask或drop_duplicates ，然后恢复原始形状：

out = (df.set_index('acct_id').stack()
         # the magic happens here
         .mask(lambda d: d.duplicated())
         # restore original format
         .unstack().reset_index()
       )

Alternative:选择：

out = (df.set_index('acct_id')
         .stack().drop_duplicates().unstack()
         .reindex(df['acct_id']).reset_index().reindex(columns=df.columns)
      )

Output:输出：

  acct_id        home        work      mobile
0       A  1111111111         NaN         NaN
1       B  2222222222         NaN         NaN
2       C  3333333333         NaN         NaN
3       D  4444444444  5555555555         NaN
4       E  6666666666  7777777777  8888888888
5       F  9999999999         NaN         NaN
6       G         NaN         NaN         NaN
7       H         NaN         NaN         NaN
8       I         NaN         NaN         NaN

order: columns first顺序：列在前

If you want, you can easily tweak the above to give a column-first preference in the way to chose the duplicates to keep:如果你愿意，你可以轻松地调整上面的内容，以在选择要保留的重复项的方式中给出列优先的偏好：

out = (df.set_index('acct_id').unstack()
         .mask(lambda d: d.duplicated())
         .swaplevel().unstack().reset_index() # or: .unstack().T.reset_index()
      )

Output:输出：

  acct_id        home        work      mobile
0       A  1111111111         NaN         NaN
1       B  2222222222         NaN         NaN
2       C  3333333333         NaN         NaN
3       D  4444444444  5555555555         NaN
4       E  6666666666         NaN  8888888888
5       F  9999999999         NaN         NaN
6       G  7777777777         NaN         NaN
7       H         NaN         NaN         NaN
8       I         NaN         NaN         NaN

如果 pandas 列值等于另一列的值，则更新它们

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-12-17 00:57:06

解决方案2
0 2022-12-17 01:52:07

order: columns first顺序：列在前

如果 pandas 列值等于另一列的值，则更新它们

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-12-17 00:57:06

解决方案2 0 2022-12-17 01:52:07

order: columns first顺序：列在前

解决方案1
1 已采纳 2022-12-17 00:57:06

解决方案2
0 2022-12-17 01:52:07