正则表达式不匹配字符串末尾的整个单词（bigram），仅在开头和中间

Question

我想取一个二元组，看看它是否存在于两个段（字符串）中，称为源和目标（就像将一种源语言翻译成目标语言一样）。 例如，“星球大战”出现在“星球大战电影”和“星球大战电影”中。 这意味着“星球大战”是未翻译的。 我正在使用正则表达式，以便匹配整个单词而不是子字符串。 它适用于上述两个部分，但当“星球大战”位于第二个字符串的末尾时，它不起作用，如“电影星球大战”。

二元组是从每行包含一个二元组的文件中读取的，我在最后删除换行符：

topword = input_file0.readline().lower().replace('\n', "")

源段和目标段是从文件中读取的，最后我删除了换行符：

srcsegm = input_file1.readline().lower().replace('\n', "")
tgtsegm = input_file2.readline().lower().replace('\n', "")

比赛的正则表达式是：

regex_match = re.compile(rf'\b{re.escape(topword)}\b')

我测试源中是否有匹配项：

has_match_src = re.search(regex_match,srcsegm)

如果有匹配，我在目标中测试匹配：

has_match_tgt = re.search(regex_match,tgtsegm)

如果两者都是正确的，我将其标记为“未翻译”术语，因为它在源语言和目标语言中是相同的。

我正在打印结果以查看发生了什么，例如：

print(topword,"untr:",srcsegm,"=====",tgtsegm)
print(topword,"translated:",srcsegm,"=====",tgtsegm)

但是当“blu ray”在字符串的开头或中间时，下面的结果是正确的：

blu ray  untr: blu ray rar ===== blu ray rar
blu ray  untr: soul blu ray disney ===== soul blu ray disney

当蓝光在字符串的末尾时是错误的：

blu ray  translated: sony blu ray player ===== sony odtwarzacz blu ray

它应该说“untr”，因为我们可以在源片段和目标片段中看到“蓝光”。 问题是：为什么它不在字符串末尾产生匹配？

这是代码：

topword = input_file0.readline().lower().replace('\n', "") # for ngrams, do not use strip and replace the newline
count_untr = 0
count_tr = 0

while len(topword)>0:   # looping the topword
    count_untr = 0
    count_tr = 0
    srcsegm = input_file1.readline().lower().replace('\n', "")
    tgtsegm = input_file2.readline().lower().replace('\n', "")
    regex_match = re.compile(rf'\b{re.escape(topword)}\b')
 
    while len(srcsegm)>0:     # looping the src and tgt segments for a topword
        has_match_src = re.search(regex_match,srcsegm) 
        if has_match_src != None:
            has_match_tgt = re.search(regex_match,tgtsegm)
            if has_match_tgt != None:
                count_untr += 1
                print(topword,"untr:",srcsegm,"=====",tgtsegm)
            else:
                count_tr += 1
                print(topword,"translated:",srcsegm,"=====",tgtsegm)

提前致谢。

Answer 1

例如，如果您在字符串中查找“星球大战”，我们希望在两个单词之间允许任意空格（例如换行符），并且这些单词必须出现在单词边界上，那么您想要使用的实际正则表达式将会：

\bstar\s+\wars\b

考虑到这一点，您应该将topword分成两个组成词，并通过 escaping 每个词单独构建搜索正则表达式，并结合它们之间的空格：

import re

#topword =  input_file0.readline().lower().strip()
topword = 'star wars'

#srcsegm = input_file1.readline().lower().strip()
srcsegm = 'i just saw star wars and loved it!'

#tgtsegm = input_file2.readline().lower().strip()
# newline between star and wars and its at the end of the string:
tgtsegm = 'tomorrow I am planning on seeing star\nwars'

# allow for arbitrary white space between words:
split_words = topword.split()
regex_search = re.compile(rf'\b{re.escape(split_words[0])}\s+{re.escape(split_words[1])}\b')

if regex_search.search(srcsegm) and regex_search.search(tgtsegm):
    print('match in both')

印刷：

match in both

正则表达式不匹配字符串末尾的整个单词（bigram），仅在开头和中间

问题描述

1 个解决方案

解决方案1
0 2021-12-29 16:22:32

正则表达式不匹配字符串末尾的整个单词（bigram），仅在开头和中间

问题描述

1 个解决方案

解决方案1 0 2021-12-29 16:22:32

解决方案1
0 2021-12-29 16:22:32