我如何对照自身检查Pandas DataFrame的列？

Question

I have a Pandas DataFrame with two relevant columns. 我有一个带有两个相关列的Pandas DataFrame。 I need to check column A (a list of names) against itself, and if two (or more) values are similar enough to each other, I sum the values in column B for those rows. 我需要对照自身检查A列（名称列表），如果两个（或多个）值彼此足够相似，则将这些行的B列中的值求和。 To check similarity, I'm using the FuzzyWuzzy package that accepts two strings and returns a score. 为了检查相似性，我使用了FuzzyWuzzy包，该包接受两个字符串并返回一个分数。

Data: 数据：

a            b   
apple        3 
orang        4 
aple         1  
orange       10  
banana       5

I want to be left with: 我想留下：

a       b
apple   4
orang   14
banana  5

I have tried the following line, but I keep getting a KeyError 我已经尝试了以下行，但是我一直收到KeyError

    df['b']=df.apply(lambda x: df.loc[fuzz.ratio(df.a,x.a)>=70,'b'].sum(), axis=1)

I would also need to remove all rows where column b was added into another row. 我还需要删除将b列添加到另一行的所有行。

Any thoughts on how to accomplish this? 关于如何实现这一目标的任何想法？

Answer 1

Some parts here are best done with pandas, and some parts (eg, a function applied to a cartesian product) can be done without it. 这里的某些部分最好用熊猫来完成，而某些部分（例如，应用于笛卡尔积的函数）可以不用它来完成。

Overall, you can do this with: 总体而言，您可以使用以下方法执行此操作：

import itertools
import numpy as np

alias = {l : r for l, r in itertools.product(df.a, df.a) if l < r and 
fuzz.ratio(l, r) > 70}
>>> df.b.groupby(df.a.replace(alias)).sum()
apple      4
banana     5
orange    14
Name: b, dtype: int64

The line 线

alias = {l : r for l, r in itertools.product(df.a, df.a) if l < r and 
fuzz.ratio(l, r) > 70}

creates a map alias , mapping words to their alias from a . 创建地图alias ，映射字从他们别名a 。

The line 线

df.b.groupby(df.a.replace(alias)).sum()

groups b by a translation using alias , and then sums. 通过使用alias的翻译将b分组，然后求和。

Answer 2

I would map and groupby: 我会映射和分组：

def get_similarity(df, ind, col):
    mapped = list(map(lambda x: fuzz.ratio(x, df[col].loc[ind]), df[col]))
    cond = (np.array(mapped) >= 70)
    label = df[col][cond].iloc[0]

    return label

And use like this: 像这样使用：

df.groupby(lambda x: get_similarity(df, x, 'a'))['b'].sum()

我如何对照自身检查Pandas DataFrame的列？

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-03-18 05:36:46

解决方案2
0 2018-03-18 05:54:05

我如何对照自身检查Pandas DataFrame的列？

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-03-18 05:36:46

解决方案2 0 2018-03-18 05:54:05

解决方案1
0 已采纳 2018-03-18 05:36:46

解决方案2
0 2018-03-18 05:54:05