[英]Update pandas column values if they are equal to another column's value
I have a pandas dataframe containing account IDs, home, work, and mobile phone numbers.我有一个 pandas 数据框,其中包含帐户 ID、家庭、工作和手机号码。 All of these values are strings.
所有这些值都是字符串。 My goal is to update the values of each row such that all duplicate numbers both within the same row and across different rows are set to NaN, leaving one 'original' number.
我的目标是更新每一行的值,使同一行和不同行中的所有重复数字都设置为 NaN,留下一个“原始”数字。 How can I accomplish this in an efficient way?
我怎样才能有效地完成这项工作?
When updating values in the same row, priority is given to home phone first and then work phone second.更新同一行中的值时,优先考虑家庭电话,然后是工作电话。 So if
home == work == mobile
, both work and mobile are updated to NaN.因此,如果
home == work == mobile
,则 work 和 mobile 都会更新为 NaN。 If home != work == mobile
, then mobile is updated to NaN.如果
home != work == mobile
,则 mobile 更新为 NaN。 When updating values in different rows, it does not matter which duplicate phone number is kept as the 'original' number.更新不同行中的值时,将哪个重复电话号码保留为“原始”号码并不重要。 For example, if
A['home'] == B['mobile'] == C['work']
, two of those values should be set to NaN and the remaining one remain unchanged.例如,如果
A['home'] == B['mobile'] == C['work']
,其中两个值应设置为 NaN,其余一个保持不变。 I have chosen to keep the first number and set the other duplicate numbers to NaN when displaying the dataframe.在显示数据框时,我选择保留第一个数字并将其他重复数字设置为 NaN。
I've figured out how to update values within the same row using df.loc
, but I've been unsuccessful in figuring out how I can also accomplish updating duplicate values to NaN across different rows and columns.我已经想出如何使用
df.loc
更新同一行内的值,但我一直没能弄清楚如何将不同行和列的重复值更新为 NaN。 How can I achieve this?我怎样才能做到这一点?
Below is further information on what I'm trying to do and where I'm getting stuck:以下是有关我正在尝试做什么以及我遇到困难的更多信息:
My initial dataframe looks something like this:我的初始数据框看起来像这样:
acct_id home work mobile
A 1111111111 1111111111 1111111111
B 2222222222 2222222222 2222222222
C 3333333333 3333333333 3333333333
D 4444444444 5555555555 5555555555
E 6666666666 7777777777 8888888888
F 9999999999 9999999999 8888888888
G 7777777777 6666666666 5555555555
H 4444444444 3333333333 2222222222
I NaN NaN NaN
and my goal is to update the dataframe so that it looks like this:我的目标是更新数据框,使其看起来像这样:
acct_id home work mobile
A 1111111111 NaN NaN
B 2222222222 NaN NaN
C 3333333333 NaN NaN
D 4444444444 5555555555 NaN
E 6666666666 7777777777 8888888888
F 9999999999 NaN NaN
G NaN NaN NaN
H NaN NaN NaN
I NaN NaN NaN
I'm currently approaching this as a 2 step problem.我目前正在将其作为两步问题来处理。 Step 1 is removing duplicate numbers in the same row.
第 1 步是删除同一行中的重复数字。 Step 2 is removing duplicate numbers across different rows and different columns.
第 2 步是删除跨不同行和不同列的重复数字。 I have figured out step 1, using the
df.loc
command:我已经找到了第 1 步,使用
df.loc
命令:
df.loc[df['home'] == df['work'], ['work']] = np.nan
df.loc[df['home'] == df['mobile'], ['mobile']] = np.nan
df.loc[df['work'] == df['mobile'], ['mobile']] = np.nan
This is what my dataframe looks like after running the above commands:这是运行上述命令后我的数据框的样子:
acct_no home work mobile
A 1111111111 NaN NaN
B 2222222222 NaN NaN
C 3333333333 NaN NaN
D 4444444444 5555555555 NaN
E 6666666666 7777777777 8888888888
F 9999999999 NaN 8888888888
G 7777777777 6666666666 5555555555
H 4444444444 3333333333 2222222222
I NaN NaN NaN
However, I can't wrap my head around step 2. As a brute force method, I have found that I can sort the dataframe on home and then loop through each row, checking if the previous row's home value is the same as the current row's, setting the current row's value to nan if it is the same.但是,我无法绕过第 2 步。作为一种蛮力方法,我发现我可以对 home 上的数据帧进行排序,然后遍历每一行,检查前一行的 home 值是否与当前行相同行的,如果当前行的值相同,则将其设置为 nan。 Lastly, I would have to repeat that process for the work and mobile keys.
最后,我将不得不为工作密钥和移动密钥重复该过程。 This is what the code for checking the home field would look like:
这是检查主场的代码的样子:
df.sort_values(by='home', inplace=True)
prev_row = {'home':None,'work':None,'mobile':None}
for cur_idx,cur_row in df.iterrows():
if prev_row['home'] == cur_row['home']:
cur_row['home'] = np.nan
prev_row = cur_row
After running the above code just for updating and checking on the home column, my dataframe will look like this:在运行上面的代码只是为了更新和检查主页列之后,我的数据框将如下所示:
acct_no home work mobile
A 1111111111 NaN NaN
B 2222222222 NaN NaN
C 3333333333 NaN NaN
D 4444444444 5555555555 NaN
E NaN 3333333333 2222222222
F 6666666666 7777777777 8888888888
G 7777777777 6666666666 5555555555
H 9999999999 NaN 8888888888
I NaN NaN NaN
This solution is pretty hacky and not efficient for larger datasets, so how can I achieve this in a more efficient manner?这个解决方案非常笨拙,对于较大的数据集效率不高,那么我怎样才能以更有效的方式实现这一目标呢?
Any help is greatly appreciated -- thank you in advance!非常感谢任何帮助 - 在此先感谢您!
This might address your step 2 needs.这可能会解决您第 2 步的需求。 If not, feel free to go with another approach.
如果没有,请随意采用另一种方法。
df = pd.DataFrame(
[
dict(acct_no="D", home="4444444444", work="5555555555"),
dict(acct_no="E", home=np.NaN, work="3333333333", mobile="2222222222"),
dict(acct_no="J", home=np.NaN, work=np.NaN, mobile="8888888888"),
dict(acct_no="K", home=np.NaN, work="8888888888"),
dict(acct_no="L", home=np.NaN, work=np.NaN, mobile="8888888888"),
]
)
df["phone"] = (df.home
.combine_first(df.work)
.combine_first(df.mobile))
df = (df.sort_values(by="phone")
.drop_duplicates(subset="phone")
.set_index("acct_no"))
print(df)
output输出
home work mobile phone
acct_no
E NaN 3333333333 2222222222 3333333333
D 4444444444 5555555555 NaN 4444444444
J NaN NaN 8888888888 8888888888
In this implementation we are only looking at the phone
column, which is the preferred number for an account.在此实现中,我们只查看
phone
列,这是帐户的首选号码。 That might be a bit more draconian that desired.这可能比期望的更严厉一些。 Notice for example that accounts "K" and "L" were nuked entirely, on the basis of sharing a phone number with "J".
例如,请注意帐户“K”和“L”在与“J”共享电话号码的基础上完全被删除。 If multiple customers share a Home land line, that might not be the desired business logic.
如果多个客户共享一条家庭固定电话,那可能不是所需的业务逻辑。 Notice also that if "K" were to add a Home number of 7878787878 he would survive, despite the 8888888888 dup.
另请注意,如果“K”要添加家庭号码 7878787878,他将存活下来,尽管 8888888888 dup。 If Mobile is "more unique" than Home, perhaps we should prefer that number.
如果移动比家庭“更独特”,也许我们应该更喜欢那个数字。
Now that we have used the phone
column to good advantage, feel free to.drop() it.现在我们已经充分利用了
phone
列,可以随意使用 .drop() 了。
The sort costs O(N log N), and everything else is linear, so this should be a performant solution, even for large datasets.排序成本为 O(N log N),其他一切都是线性的,因此这应该是一个高性能的解决方案,即使对于大型数据集也是如此。
In my opinion, the simplest would be to stack
the numbers as Series, mask
or drop_duplicates
and then restore the original shape:在我看来,最简单的方法是将数字
stack
为 Series、 mask
或drop_duplicates
,然后恢复原始形状:
out = (df.set_index('acct_id').stack()
# the magic happens here
.mask(lambda d: d.duplicated())
# restore original format
.unstack().reset_index()
)
Alternative:选择:
out = (df.set_index('acct_id')
.stack().drop_duplicates().unstack()
.reindex(df['acct_id']).reset_index().reindex(columns=df.columns)
)
Output:输出:
acct_id home work mobile
0 A 1111111111 NaN NaN
1 B 2222222222 NaN NaN
2 C 3333333333 NaN NaN
3 D 4444444444 5555555555 NaN
4 E 6666666666 7777777777 8888888888
5 F 9999999999 NaN NaN
6 G NaN NaN NaN
7 H NaN NaN NaN
8 I NaN NaN NaN
If you want, you can easily tweak the above to give a column-first preference in the way to chose the duplicates to keep:如果你愿意,你可以轻松地调整上面的内容,以在选择要保留的重复项的方式中给出列优先的偏好:
out = (df.set_index('acct_id').unstack()
.mask(lambda d: d.duplicated())
.swaplevel().unstack().reset_index() # or: .unstack().T.reset_index()
)
Output:输出:
acct_id home work mobile
0 A 1111111111 NaN NaN
1 B 2222222222 NaN NaN
2 C 3333333333 NaN NaN
3 D 4444444444 5555555555 NaN
4 E 6666666666 NaN 8888888888
5 F 9999999999 NaN NaN
6 G 7777777777 NaN NaN
7 H NaN NaN NaN
8 I NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.