简体   繁体   中英

Python: Check if keyword occurring split in a string

I have a two dataframes - One contains free-flowing text description and the other is the master dictionary. I am trying to check if the words in master dictionary occurs in the text description in any format - for example if master keyword is 123456789 , it can be present in user text as 12345 6789 or 123 456 789 . The keyword can be both numeric as well alphanumeric.

I have tried to remove spaces in the text description and check using in function but this approach also matches noises. Ex, it will also match b123 4567 89klx . I want to match only if the whole keyword is split and given as multiple words and not in-between different words.

Code I have now:

def matcher(x,word_dict):
    match=""
    for i in list(dict.fromkeys(word_dict)):
        if i.replace(" ", "").lower() in x.replace(" ", "").lower():
            if(match==""):
                match=i
            else:
                match=match+"_"+i
    return match


import pandas as pd
df = pd.DataFrame({'ID' : ['1', '2', '3', '4','5'], 
        'Text' : ['sample 123 45 678 text','sample as123456 text','sample As123 456','sample bas123456 text','sample bas123 456ts text']}, 
                  columns = ['ID','Text'])

master_dict= pd.DataFrame({'Keyword' : ['12345678','as123456']}, 
                  columns = ['Keyword'])

df['Match']=df['Text'].apply(lambda x: matcher(x,master_dict.Keyword))


Expected Output
    ID  Text                     Match
0   1   sample 123 45 678 text   12345678
1   2   sample as123456 text     as123456
2   3   sample As123 456         as123456
3   4   sample bas123456 text    NA
4   5   sample bas123 456ts text NA

Any leads will be helpful. Thanks in advance.

You can use a Pandas adaptation of my previous solution :

import pandas as pd
import numpy as np
import re

df = pd.DataFrame({'ID' : ['1', '2', '3', '4','5'], 
        'Text' : ['sample 123 45 678 text','sample as123456 text','sample As123 456','sample bas123456 text','sample bas123 456ts text']}, 
        columns = ['ID','Text'])
master_dict= pd.DataFrame({'Keyword' : ['12345678','as123456']}, 
                  columns = ['Keyword'])

words = master_dict['Keyword'].to_list()
words_dict = { f'g{i}':item for i,item in enumerate(words) } 
rx = re.compile(r"(?i)\b(?:" + '|'.join([ r'(?P<g{}>{})'.format(i,"[\W_]*".join([c for c in item])) for i,item in enumerate(words)]) + r")\b")
print(rx.pattern)

def findvalues(x):
    m = rx.search(x)
    if m:
        return [words_dict.get(key) for key,value in m.groupdict().items() if value][0]
    else:
        return np.nan

df['Match'] = df['Text'].apply(lambda x: findvalues(x))

The pattern is

(?i)\b(?:(?P<g0>1[\W_]*2[\W_]*3[\W_]*4[\W_]*5[\W_]*6[\W_]*7[\W_]*8)|(?P<g1>a[\W_]*s[\W_]*1[\W_]*2[\W_]*3[\W_]*4[\W_]*5[\W_]*6))\b

See the regex demo . Basically, it is a \\b(?:keyword1|keyword2|...|keywordN)\\b regex, with [\\W_]* (that matches any zero or more non-alphanumeric chars) in between every char. Due to \\b , word boundaries, the keywords are only matched as whole words. It will work for your keywords, since you confirm they are numeric or alphanumeric.

Demo output:

>>> df
  ID                      Text     Match
0  1    sample 123 45 678 text  12345678
1  2      sample as123456 text  as123456
2  3          sample As123 456  as123456
3  4     sample bas123456 text       NaN
4  5  sample bas123 456ts text       NaN
>>> 

Checking with in function will get you true if that string is a part of the other string i think checking with:

if string == keyword:

will result in what you want after you deal with spaces and so if the result is not exactly equal to the key word it should return False.

Let me know if i correctly understood what you're asking for and whether it helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM