繁体   English   中英

正则表达式识别句子中的某些单词并且只识别前两个单词

[英]Regex that recognises certain words in a sentence and only the first two words

我有个问题。 我想使用正则表达式来识别文本中的某些文本模块。 例如, beach vibe some 问题是一些文本模块是三个字长(甚至更长)。 然而,大多数人只使用前两个,也许是第二个单词的缩写。

如果正则表达式只识别前两个单词,是否可以选择说它应该命中? 并且它应该只查看第二个单词的前三个字母?

   customerId                          text          element  code
0           1    please use beach vibe some  beach vibe some     0
1           1     you should use beach vibe  beach vibe some     0
2           1           right use beach vib  beach vibe some     0
3           3              use floating pow   floating power     1
4           3  use floating stuff right now   floating stuff     2
import pandas as pd
import copy
import re
d = {
    "customerId": [1, 1, 1, 3, 3],
    "text": ["please use beach vibe some",
             "you should use beach vibe",
             "right use beach vib",
             'use floating pow',
             'use floating stuff right now'],
     "element": ['beach vibe some', 'beach vibe some', 'beach vibe some', 'floating power', 'floating stuff']
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
print(df)

def f(x):
    match = 999
    for element in df['element'].unique():
        check = bool(re.search(element, x['text'], re.IGNORECASE))
        if(check):
            #print(forwarder)
            match = df['code'].loc[df['element']== element].iloc[0]
            break
        elif(re.search(' '.join(element.split()[:2]), x['text'], re.IGNORECASE)):
            match = df['code'].loc[df['element']== element].iloc[0]
            break
        else:
          s = element.split()
          s[1] = s[1][:3]
          string = ' '.join(s[:2])
          if(bool(re.search(string, x['text'], re.IGNORECASE))):
            match = df['code'].loc[df['element']== element].iloc[0]
            break

    x['test'] = match
    return x
    #print(match)
df['test'] = None
df = df.apply(lambda x: f(x), axis = 1)
print(df)
   customerId                          text          element  code  test
0           1    please use beach vibe some  beach vibe some     0     0
1           1     you should use beach vibe  beach vibe some     0     0
2           1           right use beach vib  beach vibe some     0     0
3           3              use floating pow   floating power     1     1
4           3  use floating stuff right now   floating stuff     2     2

为什么要使用正则表达式?

element_parts = element.lower().split()
lookup_key = element_parts[0] + " " + element_parts[1][:3] 
if lookup_key in x["text"].lower():
    # here we go ...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM