[英]How do I check if dataframe column contains a string from another dataframe column and return adjacent cell in python pandas?
I have 2 dataframes, one containing a columnn of strings (df = data) which I need to categorise, and the other containing possible categories and search terms (df = categories).我有 2 个数据框,一个包含我需要分类的一列字符串(df = 数据),另一个包含可能的类别和搜索词(df = 类别)。 I would like to add a column to the "data" dataframe which returns a category based on search terms.
我想在“数据”dataframe 中添加一列,它会根据搜索词返回一个类别。 For example:
例如:
data:数据:
**RepairName**
A/C is not cold
flat tyre is c
the tyre needs a repair on left side
the aircon is not cold
categories:类别:
**Category** **SearchTerm**
A/C aircon
A/C A/C
Tyre repair
Tyre flat
DESIRED RESULT data:期望的结果数据:
**RepairName** **Category**
A/C is not cold A/C
flat tyre is c Tyre
the tyre needs a repair on left side Tyre
the aircon is not cold A/C
I have tried the following lambda function with apply.我已经尝试了以下 lambda function 与应用。 I am not sure if my column references are in the correct place:
我不确定我的列引用是否在正确的位置:
data['Category'] = data['RepairName'].apply(lambda x: categories['Category'] if categories['SearchTerm'] in x else "")
data['Category'] = [categories['Category'] if categories['SearchTerm'] in data['RepairName'] else 0]
but I keep getting the error messge:但我不断收到错误消息:
TypeError: 'in <string>' requires string as left operand, not Series
This provides true / false as to whether a category exists based on SearchTerm, however I have not been able to return the category associated with the Search Term:这提供了基于 SearchTerm 的类别是否存在的真/假,但是我无法返回与搜索词关联的类别:
data['containName']=data['RepairName'].str.contains('|'.join(categories['SearchTerm']),case=False)
And these both sometimes work, but not all the time (perhaps because some of my search terms are more than one word?)这两者有时都有效,但并非一直有效(也许是因为我的某些搜索词不止一个词?)
data['Category'] = [
next((c for c, k in categories.values if k in s), None) for s in data['RepairName']]
d = dict(zip(categories['SearchTerm'], categories['Category']))
data['CategoryCheck'] = [next((d[y] for y in x.split() if y in d), None) for x in data['RepairName']]
We do str.findall
then map
我们先做
str.findall
然后map
s=df.RepairName.str.findall('|'.join(cat.SearchTerm.tolist())).str[0].\
map(cat.set_index('SearchTerm').Category)
0 A/C
1 Tyre
2 Tyre
3 A/C
Name: RepairName, dtype: object
df['Category']=s
This worked once I had ensured all my columns were lower case (I also removed hyphens and brackets as well for good measure):一旦我确保我的所有列都是小写的(我还删除了连字符和括号以更好地衡量),这就会起作用:
print("All lowercase")
data = data.apply(lambda x: x.astype(str).str.lower())
categories = categories.apply(lambda x: x.astype(str).str.lower())
print("Remove double spacing")
data = data.replace('\s+', ' ', regex=True)
print('Remove hyphens')
data["RepairName"] = data["RepairName"].str.replace('-', '')
print('Remove brackets')
data["RepairName"] = data["RepairName"].str.replace('(', '')
data["RepairName"] = data["RepairName"].str.replace(')', '')
data['Category'] = [
next((c for c, k in categories.values if k in s), None) for s in data['RepairName']]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.