[英]Multiple column pandas vectorized string function?
Is there a way of querying a DataFrame for rows that contain a certain string in any column? 有没有办法在任何列中查询包含特定字符串的行的DataFrame? Something like Series.str
except for a DataFrame? 类似于Series.str
东西除了DataFrame? Here's what I have so far: 这是我到目前为止所拥有的:
In [2]: s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est"
In [3]: df = pd.DataFrame(np.array(s.split(' ')).reshape((-1, 4)), columns=['one', 'two', 'three', 'four'])
In [4]: df
Out[4]:
one two three four
0 Lorem ipsum dolor sit
1 amet, consectetur adipisicing elit,
2 sed do eiusmod tempor
3 incididunt ut labore et
4 dolore magna aliqua. Ut
5 enim ad minim veniam,
6 quis nostrud exercitation ullamco
7 laboris nisi ut aliquip
8 ex ea commodo consequat.
9 Duis aute irure dolor
10 in reprehenderit in voluptate
11 velit esse cillum dolore
12 eu fugiat nulla pariatur.
13 Excepteur sint occaecat cupidatat
14 non proident, sunt in
15 culpa qui officia deserunt
16 mollit anim id est
[17 rows x 4 columns]
In [5]: mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')
In [6]: df[mask]
Out[6]:
one two three four
0 Lorem ipsum dolor sit
4 dolore magna aliqua. Ut
9 Duis aute irure dolor
11 velit esse cillum dolore
[4 rows x 4 columns]
Ideally, I would like to replace the last two lines with something similar to this: 理想情况下,我想用类似的东西替换最后两行:
df[df.ix[:, 'one':'four'].str.contains('dolor')]
Is this possible? 这可能吗?
You can use the vectorized operations of a pd.np.char.array()
: 您可以使用pd.np.char.array()
的矢量化操作:
a = pd.np.char.array(df.values)
mask = a.find('dolor')!=-1
df2 = df.iloc[np.any(mask, axis=1)]
and the content of df2
will be: 并且df2
的内容将是:
one two three four
0 Lorem ipsum dolor sit
4 dolore magna aliqua. Ut
9 Duis aute irure dolor
11 velit esse cillum dolore
Pandas does not have DataFrame.str methods (at least not yet). Pandas没有DataFrame.str方法(至少还没有)。 However, you could use 但是,你可以使用
import numpy as np
mask = np.logical_or.reduce(
[df[col].str.contains('dolor')
for col in df.loc[:, 'one':'four'].columns])
This is a little less writing, and a bit quicker than 这比写作少一点,而且比它快一点
mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')
In [29]: %timeit mask = np.logical_or.reduce([df[col].str.contains('dolor') for col in df.loc[:, 'one':'four'].columns]); df[mask]
1000 loops, best of 3: 761 µs per loop
In [30]: %timeit mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor'); df[mask]
1000 loops, best of 3: 1.13 ms per loop
this will give you information if theres 'dolor' in any of the columns: 如果在任何列中都有'dolor',这将为您提供信息:
df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1)
will give you true/false value for each row of any of the columns 将为任何列的每一行提供true / false值
if you combine this with another apply, you'll get info for the total columns 如果您将此项与另一项申请相结合,您将获得总列数的信息
df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1).apply(lambda x: True in x.values, axis=1)
and using this as the column mask will give your result: 并使用它作为列掩码将给出您的结果:
df[df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1).apply(lambda x: True in x.values, axis=1)]
however this is about 3-4 times slower :( that unutbu solutions. 然而这大约慢3-4倍:( unutbu解决方案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.