简体   繁体   中英

Find value in one column in another column with regex in pandas

I have a pandas dataframe with two columns of strings. I want to identify all row where the string in the first column ( s1 ) appears within the string in the second column ( s2 ).

So if my columns were:

abc    abcd*ef_gh
z1y    xxyyzz

I want to keep the first row, but not the second.

The only approach I can think of is to:

  1. iterate through dataframe rows
  2. apply df.str.contains() to s2 using the contents of s1 as the matching pattern

Is there a way to accomplish this that doesn't require iterating over the rows?

It is probably doable (for simple matching only), in a vectorised way, with numpy chararray methods :

In [326]:

print df
    s1          s2
0  abc  abcd*ef_gh
1  z1y      xxyyzz
2  aaa   aaabbbsss
In [327]:

print df.ix[np.char.find(df.s2.values.astype(str), 
                         df.s1.values.astype(str))>=0, 
            's1']
0    abc
2    aaa
Name: s1, dtype: object

The best I could come up with is to use apply instead of manual iterations:

>> df = pd.DataFrame({'x': ['abc', 'xyz'], 'y': ['1234', '12xyz34']})
>> df
     x        y
0  abc     1234
1  xyz  12xyz34

>> df.x[df.apply(lambda row: row.y.find(row.x) != -1, axis=1)]
1    xyz
Name: x, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM