简体   繁体   English

如何在Python熊猫中修改重复的行

[英]How to modify duplicated rows in Python pandas

Let's say I have a DataFrame (that I sorted by some priority criterion) with a " name " column. 假设我有一个带有“ name ”列的DataFrame(按某种优先级标准排序)。 Few names are duplicated, and I want to append a simple indicator to the duplicates. 很少有重复的名称,我想在重复的名称后面附加一个简单的指示符。

Eg, 例如,

'jones a'
... 
'jones a'    # this should become 'jones a2'

To get the subset of duplicates, I could do 要获得重复的子集,我可以

df.loc[df.duplicated(subset=['name'], take_last=True), 'name']

However, I think the apply function does not allow for inplace modification, right? 但是,我认为apply函数不允许inplace修改,对吗? So what I basically ended up doing is: 所以我最终要做的是:

df.loc[df.duplicated(subset=['name'], take_last=True), 'name'] = \
df.loc[df.duplicated(subset=['name'], take_last=True), 'name'].apply(lambda x: x+'2')

But my feeling is that there might be a better way. 但是我的感觉是可能会有更好的方法。 Any ideas or tips? 有什么想法或提示吗? I would really appreciate your feedback! 非常感谢您的反馈!

Here is one way: 这是一种方法:

# sample data
d = pandas.DataFrame(
    {'Name': ['bob', 'bob', 'bob', 'bill', 'fred', 'fred', 'joe', 'larry'],
     'ShoeShize': [8, 9, 10, 12, 14, 11, 10, 12]
    }
)

>>> d.groupby('Name').Name.apply(lambda n: n + (np.arange(len(n))+1).astype(str))
0      bob1
1      bob2
2      bob3
3     bill1
4     fred1
5     fred2
6      joe1
7    larry1

This appends an indicator to all. 这将为所有指标附加指标。 If you want to append the indicator to only those after the first, you can do it with a little special casing: 如果您只想将指标追加到第一个指标之后,可以使用一些特殊的大小写:

>>> d.groupby('Name').Name.apply(lambda n: n + np.concatenate(([''], (np.arange(len(n))+1).astype(str)[1:])))
0      bob
1     bob2
2     bob3
3     bill
4     fred
5    fred2
6      joe
7    larry
dtype: object

If you want to use this to replace the original names just do d.Name = ... where ... is the expression shown above. 如果要使用它替换原始名称,只需执行d.Name = ... ,其中...是上面显示的表达式。

You should think about why you're doing this. 您应该考虑为什么要这样做。 It is usually better to have this sort of information in a separate column than smashed into a string. 通常,最好将此类信息放在单独的列中,而不是粉碎成字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM