[英]I want to compare word pair in panda data frame
Names
['abc aa','bdc sc','abc aa','bdc sp','bdc sc','pp sc','bdc sc',]
['lp aa','bd sc','bdc sc','bd sc','lp aa','bd sc']
['nn aa','bb sc','bb sc','nn aa','bd sc']
I tried as我试过
def drop_dupli(text):
#seen = set()
result = []
for item in text.split():
if item not in seen:
seen.add(item)
result. Append(item)
return " ".join(result)
df['newame'] = df['Names'].apply(lambda x: drop_dupli(x))
The result came as follows:结果如下:
Names
['abc aa','bdc sc','abc ','bdc sp','bdc ','pp sc','bdc ',]
['lp aa','bd sc','bdc sc','bd ','lp ','bd ']
['nn aa','bb sc','bb ','nn ','bd ']
But, I want to get the result should come as follows:但是,我想得到的结果应该如下:
Names
['abc aa','bdc sc','bdc sp','pp sc']
['lp aa','bd sc','bdc sc']
['nn aa','bb sc','bd sc']
Use dict.fromkeys
trick for remove duplicates in original order:使用
dict.fromkeys
技巧按原始顺序删除重复项:
df['newame'] = df['Names'].apply(lambda x: list(dict.fromkeys(x)))
print (df)
Names \
0 [abc aa, bdc sc, abc aa, bdc sp, bdc sc, pp sc...
1 [lp aa, bd sc, bdc sc, bd sc, lp aa, bd sc]
2 [nn aa, bb sc, bb sc, nn aa, bd sc]
newame
0 [abc aa, bdc sc, bdc sp, pp sc]
1 [lp aa, bd sc, bdc sc]
2 [nn aa, bb sc, bd sc]
because if use set
s order is changed:因为如果使用
set
的顺序改变了:
df['newame'] = df['Names'].apply(lambda x: list(set(x)))
print (df)
Names \
0 [abc aa, bdc sc, abc aa, bdc sp, bdc sc, pp sc...
1 [lp aa, bd sc, bdc sc, bd sc, lp aa, bd sc]
2 [nn aa, bb sc, bb sc, nn aa, bd sc]
newame
0 [pp sc, bdc sp, bdc sc, abc aa]
1 [lp aa, bd sc, bdc sc]
2 [bb sc, nn aa, bd sc]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.