简体   繁体   中英

Finding exact word in a column of strings from a list of words having spaces in between

I want to create a new column with 1 or 0, if any of the words in a list is matched exaclty with the dataframe string column.

The words in the list can have multiple spaces in between , so I am not able to use str.split() for exact match.

list_provided=["mul the","a b c"]
#how my dataframe looks
id  text
a    simultaneous there the
b    simultaneous there
c    mul why the
d    mul the
e    simul a b c
f    a c b

Expected Output

id  text                      found
a    simultaneous there the    0
b    simultaneous there        0
c    mul why the               0
d    mul the                   1
e    simul a b c               1 
f    a c b                     0

Ordering of the words in the list element also matters!!

Code tried till now

data=pd.DataFrame({"id":("a","b","c","d","e","f"), "text":("simultaneous there the","simultaneous there","mul why the","mul the","simul a b c","a c b")})
list_of_word=["mul the","a b c"]
pattern = '|'.join(list_of_word)
data['found'] = data['text'].apply(lambda x: sum(i in list_of_test_2 for i in x.split()))
data['found']=np.where(data['found']>0,1,0)
data
###Output generated###
id  text                   found
a   simultaneous there the  0
b   simultaneous there      0
c   mul why the             0
d   mul the                 0
e   simul a b c             0
f   a c b                   0

How to obtain the expected output where I have to search for exact match of words from a list against a dataframe string column, having multiple spaces in between?

You were nearly there, you've done all the ground work, now all that's left is to call the right function, in this case, str.contains .

data['found'] = data.text.str.contains(pattern).astype(int)
data

  id                    text  found
0  a  simultaneous there the      0
1  b      simultaneous there      0
2  c             mul why the      0
3  d                 mul the      1
4  e             simul a b c      1
5  f                   a c b      0

If your patterns themselves contain the regex OR pipe, try escaping them first:

import re
pattern = '|'.join([re.escape(i) for i in list_of_word])

You can achieve this with the help of str.contains. This can take up regex too!

data['found'] = np.where(data['text'].str.contains(pattern),1,0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM