簡體   English   中英

正則表達式識別句子中的某些單詞並且只識別前兩個單詞

[英]Regex that recognises certain words in a sentence and only the first two words

我有個問題。 我想使用正則表達式來識別文本中的某些文本模塊。 例如, beach vibe some 問題是一些文本模塊是三個字長(甚至更長)。 然而,大多數人只使用前兩個,也許是第二個單詞的縮寫。

如果正則表達式只識別前兩個單詞,是否可以選擇說它應該命中? 並且它應該只查看第二個單詞的前三個字母?

   customerId                          text          element  code
0           1    please use beach vibe some  beach vibe some     0
1           1     you should use beach vibe  beach vibe some     0
2           1           right use beach vib  beach vibe some     0
3           3              use floating pow   floating power     1
4           3  use floating stuff right now   floating stuff     2
import pandas as pd
import copy
import re
d = {
    "customerId": [1, 1, 1, 3, 3],
    "text": ["please use beach vibe some",
             "you should use beach vibe",
             "right use beach vib",
             'use floating pow',
             'use floating stuff right now'],
     "element": ['beach vibe some', 'beach vibe some', 'beach vibe some', 'floating power', 'floating stuff']
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
print(df)

def f(x):
    match = 999
    for element in df['element'].unique():
        check = bool(re.search(element, x['text'], re.IGNORECASE))
        if(check):
            #print(forwarder)
            match = df['code'].loc[df['element']== element].iloc[0]
            break
        elif(re.search(' '.join(element.split()[:2]), x['text'], re.IGNORECASE)):
            match = df['code'].loc[df['element']== element].iloc[0]
            break
        else:
          s = element.split()
          s[1] = s[1][:3]
          string = ' '.join(s[:2])
          if(bool(re.search(string, x['text'], re.IGNORECASE))):
            match = df['code'].loc[df['element']== element].iloc[0]
            break

    x['test'] = match
    return x
    #print(match)
df['test'] = None
df = df.apply(lambda x: f(x), axis = 1)
print(df)
   customerId                          text          element  code  test
0           1    please use beach vibe some  beach vibe some     0     0
1           1     you should use beach vibe  beach vibe some     0     0
2           1           right use beach vib  beach vibe some     0     0
3           3              use floating pow   floating power     1     1
4           3  use floating stuff right now   floating stuff     2     2

為什么要使用正則表達式?

element_parts = element.lower().split()
lookup_key = element_parts[0] + " " + element_parts[1][:3] 
if lookup_key in x["text"].lower():
    # here we go ...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM