简体   繁体   English

标准化数据框列中的值

[英]Standardize values in a data-frame column

I have a dataframe df which looks like: 我有一个数据框df,看起来像:

id colour  response
 1   blue    curent 
 2    red   loaning
 3 yellow   current
 4  green      loan 
 5    red   currret
 6  green      loan

You can see the values in the response column are not uniform and I would like to get the to snap to a standardized set of responses. 您可以看到“响应”列中的值不统一,我希望将其捕捉到一组标准化的响应中。

I also have a validation list validate which looks like 我也有一个验证列表validate看起来像

validate
 current
    loan
transfer

I would like to standardise the response column in the df based on the first three characters in the entry against the validate list 我想根据验证列表中条目的前三个字符对df中的响应列进行标准化

So the eventual output would look like: 因此,最终输出将如下所示:

id colour  response
 1   blue   current
 2    red      loan
 3 yellow   current
 4  green      loan 
 5    red   current
 6  green      loan

have tried to use fnmatch 尝试使用fnmatch

pattern = 'cur*'
fnmatch.filter(df, pattern) = 'current'

but can't change the values in the df. 但无法更改df中的值。

If anyone could offer assistance it would be appreciated 如果有人可以提供帮助,将不胜感激

Thanks 谢谢

You could use map 你可以用map

In [3664]: mapping = dict(zip(s.str[:3], s))

In [3665]: df.response.str[:3].map(mapping)
Out[3665]:
0    current
1       loan
2    current
3       loan
4    current
5       loan
Name: response, dtype: object

In [3666]: df['response2'] = df.response.str[:3].map(mapping)

In [3667]: df
Out[3667]:
   id  colour response response2
0   1    blue   curent   current
1   2     red  loaning      loan
2   3  yellow  current   current
3   4   green     loan      loan
4   5     red  currret   current
5   6   green     loan      loan

Where s is series of validation values. 其中s是一系列验证值。

In [3650]: s
Out[3650]:
0     current
1        loan
2    transfer
Name: validate, dtype: object

Details 细节

In [3652]: mapping
Out[3652]: {'cur': 'current', 'loa': 'loan', 'tra': 'transfer'}

mapping can be series too mapping也可以是系列

In [3678]: pd.Series(s.str[:3].values, index=s.values)
Out[3678]:
current     cur
loan        loa
transfer    tra
dtype: object

Fuzzy match ? 模糊匹配?

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
a=[]
for x in df.response:
    a.append([process.extract(x, val.validate, limit=1)][0][0][0])
df['response2']=a
df
Out[867]: 
   id  colour response response2
0   1    blue   curent   current
1   2     red  loaning      loan
2   3  yellow  current   current
3   4   green     loan      loan
4   5     red  currret   current
5   6   green     loan      loan

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM