[英]Standardize values in a data-frame column
I have a dataframe df which looks like: 我有一个数据框df,看起来像:
id colour response
1 blue curent
2 red loaning
3 yellow current
4 green loan
5 red currret
6 green loan
You can see the values in the response column are not uniform and I would like to get the to snap to a standardized set of responses. 您可以看到“响应”列中的值不统一,我希望将其捕捉到一组标准化的响应中。
I also have a validation list validate
which looks like 我也有一个验证列表
validate
看起来像
validate
current
loan
transfer
I would like to standardise the response column in the df based on the first three characters in the entry against the validate list 我想根据验证列表中条目的前三个字符对df中的响应列进行标准化
So the eventual output would look like: 因此,最终输出将如下所示:
id colour response
1 blue current
2 red loan
3 yellow current
4 green loan
5 red current
6 green loan
have tried to use fnmatch 尝试使用fnmatch
pattern = 'cur*'
fnmatch.filter(df, pattern) = 'current'
but can't change the values in the df. 但无法更改df中的值。
If anyone could offer assistance it would be appreciated 如果有人可以提供帮助,将不胜感激
Thanks 谢谢
You could use map
你可以用
map
In [3664]: mapping = dict(zip(s.str[:3], s))
In [3665]: df.response.str[:3].map(mapping)
Out[3665]:
0 current
1 loan
2 current
3 loan
4 current
5 loan
Name: response, dtype: object
In [3666]: df['response2'] = df.response.str[:3].map(mapping)
In [3667]: df
Out[3667]:
id colour response response2
0 1 blue curent current
1 2 red loaning loan
2 3 yellow current current
3 4 green loan loan
4 5 red currret current
5 6 green loan loan
Where s
is series of validation values. 其中
s
是一系列验证值。
In [3650]: s
Out[3650]:
0 current
1 loan
2 transfer
Name: validate, dtype: object
Details 细节
In [3652]: mapping
Out[3652]: {'cur': 'current', 'loa': 'loan', 'tra': 'transfer'}
mapping
can be series too mapping
也可以是系列
In [3678]: pd.Series(s.str[:3].values, index=s.values)
Out[3678]:
current cur
loan loa
transfer tra
dtype: object
Fuzzy match ? 模糊匹配?
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
a=[]
for x in df.response:
a.append([process.extract(x, val.validate, limit=1)][0][0][0])
df['response2']=a
df
Out[867]:
id colour response response2
0 1 blue curent current
1 2 red loaning loan
2 3 yellow current current
3 4 green loan loan
4 5 red currret current
5 6 green loan loan
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.