[英]Searching multiple substrings in a column of strings and return substring category
我有两个数据帧如下:
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea"]})
df2 = pd.DataFrame({"category":[1, 1, 2, 2, 3, 3],
"keywords":["cat", "dog", "birds", "bats", "coffee", "tea"]})
我的数据框看起来像这样
DF1:
id string
01 This is a cat
02 That is a dog
03 Those are birds
04 These are bats
05 I drink coffee
06 I bought tea
DF2:
category keywords
1 cat
1 dog
2 birds
2 bats
3 coffee
3 tea
我想在df1上有一个输出列,如果在df1中的每个字符串中检测到df2中至少有一个关键字,则该类别,否则返回None。 预期输出应如下。
id string category
01 This is a cat 1
02 That is a dog 1
03 Those are birds 2
04 These are bats 2
05 I drink coffee 3
06 I bought tea 3
我可以考虑逐个循环关键字并逐个扫描字符串,但如果数据变大则效率不高。 我可以提出你的改进建议吗? 谢谢。
# Modified your data a bit.
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06", "07"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea",
"This won't match squat"]})
您可以使用包含next
参数的列表推导和默认参数。
df1['category'] = [
next((c for c, k in df2.values if k in s), None) for s in df1['string']]
df1
id string category
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought tea 3.0
6 07 This won't match squat NaN
你无法避免O(N 2 )的复杂性,但这应该是非常高效的,因为它并不总是必须迭代内部循环中的每个字符串(除非在最坏的情况下)。
请注意,这当前仅支持子字符串匹配(不是基于正则表达式的匹配,尽管可以进行一些修改)。
使用列表理解与df2
创建的字典split
匹配:
d = dict(zip(df2['keywords'], df2['category']))
df1['cat'] = [next((d[y] for y in x.split() if y in d), None) for x in df1['string']]
print (df1)
id string cat
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought thea NaN
另一个易于理解的解决方案映射df1['string']
:
# create a dictionary with keyword->category pairs
cats = dict(zip(df2.keywords, df2.category))
def categorize(s):
for cat in cats.keys():
if cat in s:
return cats[cat]
# return 0 in case nothing is found
return 0
df1['category'] = df1['string'].map(lambda x: categorize(x))
print(df1)
id string category
0 01 This is a cat 1
1 02 That is a dog 1
2 03 Those are birds 2
3 04 These are bats 2
4 05 I drink coffee 3
5 06 I bought tea 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.