[英]Python: String matching on a pandas column of lists
What is the best way to do string matching on a column of lists? 在列表的一列上进行字符串匹配的最佳方法是什么?
Eg I have a dataset: 例如,我有一个数据集:
import numpy as np
import pandas as pd
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':xrange(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in xrange(3)]})
df
L id
0 [tackle, apple, grapple] 0
1 [tackle, snapple, satchel] 1
2 [satchel, satchel, tackle] 2
And I want to return the rows where any item in L
matches a string, eg 'grap' should return row 0, and 'sat' should return rows 1:2. 我想返回
L
中任何项目匹配字符串的行,例如'grap'应该返回行0,而'sat'应该返回行1:2。
Let's use this: 让我们使用这个:
np.random.seed(123)
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':range(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in range(3)]})
df
L id
0 [tackle, snapple, tackle] 0
1 [grapple, satchel, tackle] 1
2 [satchel, grapple, grapple] 2
Use any
and apply
: 使用
any
并apply
:
df[df.L.apply(lambda x: any('grap' in s for s in x))]
Output: 输出:
L id
1 [grapple, satchel, tackle] 1
2 [satchel, grapple, grapple] 2
%timeit df.L.apply(lambda x: any('grap' in s for s in x))
10000 loops, best of 3: 194 µs per loop
10000次循环,最佳3:每个循环194 µs
%timeit df.L.apply(lambda i: ','.join(i)).str.contains('grap')
1000 loops, best of 3: 481 µs per loop
1000个循环,最好为3:每个循环481 µs
%timeit df.L.str.join(', ').str.contains('grap')
1000 loops, best of 3: 529 µs per loop
1000个循环,每个循环最好3:529 µs
df[df.L.apply(lambda i: ','.join(i)).str.contains('yourstring')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.