正則表達式不匹配字符串末尾的整個單詞（bigram），僅在開頭和中間

Question

我想取一個二元組，看看它是否存在於兩個段（字符串）中，稱為源和目標（就像將一種源語言翻譯成目標語言一樣）。 例如，“星球大戰”出現在“星球大戰電影”和“星球大戰電影”中。 這意味着“星球大戰”是未翻譯的。 我正在使用正則表達式，以便匹配整個單詞而不是子字符串。 它適用於上述兩個部分，但當“星球大戰”位於第二個字符串的末尾時，它不起作用，如“電影星球大戰”。

二元組是從每行包含一個二元組的文件中讀取的，我在最后刪除換行符：

topword = input_file0.readline().lower().replace('\n', "")

源段和目標段是從文件中讀取的，最后我刪除了換行符：

srcsegm = input_file1.readline().lower().replace('\n', "")
tgtsegm = input_file2.readline().lower().replace('\n', "")

比賽的正則表達式是：

regex_match = re.compile(rf'\b{re.escape(topword)}\b')

我測試源中是否有匹配項：

has_match_src = re.search(regex_match,srcsegm)

如果有匹配，我在目標中測試匹配：

has_match_tgt = re.search(regex_match,tgtsegm)

如果兩者都是正確的，我將其標記為“未翻譯”術語，因為它在源語言和目標語言中是相同的。

我正在打印結果以查看發生了什么，例如：

print(topword,"untr:",srcsegm,"=====",tgtsegm)
print(topword,"translated:",srcsegm,"=====",tgtsegm)

但是當“blu ray”在字符串的開頭或中間時，下面的結果是正確的：

blu ray  untr: blu ray rar ===== blu ray rar
blu ray  untr: soul blu ray disney ===== soul blu ray disney

當藍光在字符串的末尾時是錯誤的：

blu ray  translated: sony blu ray player ===== sony odtwarzacz blu ray

它應該說“untr”，因為我們可以在源片段和目標片段中看到“藍光”。 問題是：為什么它不在字符串末尾產生匹配？

這是代碼：

topword = input_file0.readline().lower().replace('\n', "") # for ngrams, do not use strip and replace the newline
count_untr = 0
count_tr = 0

while len(topword)>0:   # looping the topword
    count_untr = 0
    count_tr = 0
    srcsegm = input_file1.readline().lower().replace('\n', "")
    tgtsegm = input_file2.readline().lower().replace('\n', "")
    regex_match = re.compile(rf'\b{re.escape(topword)}\b')
 
    while len(srcsegm)>0:     # looping the src and tgt segments for a topword
        has_match_src = re.search(regex_match,srcsegm) 
        if has_match_src != None:
            has_match_tgt = re.search(regex_match,tgtsegm)
            if has_match_tgt != None:
                count_untr += 1
                print(topword,"untr:",srcsegm,"=====",tgtsegm)
            else:
                count_tr += 1
                print(topword,"translated:",srcsegm,"=====",tgtsegm)

提前致謝。

Answer 1

例如，如果您在字符串中查找“星球大戰”，我們希望在兩個單詞之間允許任意空格（例如換行符），並且這些單詞必須出現在單詞邊界上，那么您想要使用的實際正則表達式將會：

\bstar\s+\wars\b

考慮到這一點，您應該將topword分成兩個組成詞，並通過 escaping 每個詞單獨構建搜索正則表達式，並結合它們之間的空格：

import re

#topword =  input_file0.readline().lower().strip()
topword = 'star wars'

#srcsegm = input_file1.readline().lower().strip()
srcsegm = 'i just saw star wars and loved it!'

#tgtsegm = input_file2.readline().lower().strip()
# newline between star and wars and its at the end of the string:
tgtsegm = 'tomorrow I am planning on seeing star\nwars'

# allow for arbitrary white space between words:
split_words = topword.split()
regex_search = re.compile(rf'\b{re.escape(split_words[0])}\s+{re.escape(split_words[1])}\b')

if regex_search.search(srcsegm) and regex_search.search(tgtsegm):
    print('match in both')

印刷：

match in both

正則表達式不匹配字符串末尾的整個單詞（bigram），僅在開頭和中間

問題描述

1 個解決方案

解決方案1
0 2021-12-29 16:22:32

正則表達式不匹配字符串末尾的整個單詞（bigram），僅在開頭和中間

問題描述

1 個解決方案

解決方案1 0 2021-12-29 16:22:32

解決方案1
0 2021-12-29 16:22:32