简体   繁体   English

正则表达式不匹配字符串末尾的整个单词(bigram),仅在开头和中间

[英]Regex not matching a whole word (bigram) at the end of a string, only at the beginning and middle

I want to take a bigram and see if it is present in two segments (strings), called source and target (as in one source language being translated into a target language).我想取一个二元组,看看它是否存在于两个段(字符串)中,称为源和目标(就像将一种源语言翻译成目标语言一样)。 For example, "star wars", is present in "star wars movie" and in "star wars filme".例如,“星球大战”出现在“星球大战电影”和“星球大战电影”中。 This means that "star wars" is untranslated.这意味着“星球大战”是未翻译的。 I am using a regular expression, so that the match is on whole words and not substrings.我正在使用正则表达式,以便匹配整个单词而不是子字符串。 It is working for the two segments above, but it is not working when "star wars" is at the end of the second string, as in "filme star wars".它适用于上述两个部分,但当“星球大战”位于第二个字符串的末尾时,它不起作用,如“电影星球大战”。

The bigram is read from a file that contains one bigram per line, and I am removing the newline at the end:二元组是从每行包含一个二元组的文件中读取的,我在最后删除换行符:

topword = input_file0.readline().lower().replace('\n', "")

The source and target segments are being read from files, and I am removing the newline at the end:源段和目标段是从文件中读取的,最后我删除了换行符:

srcsegm = input_file1.readline().lower().replace('\n', "")
tgtsegm = input_file2.readline().lower().replace('\n', "")

The regex for the match is:比赛的正则表达式是:

regex_match = re.compile(rf'\b{re.escape(topword)}\b')

I test if there is a match in the source:我测试源中是否有匹配项:

has_match_src = re.search(regex_match,srcsegm)

If there is a match, I test for a match in the target:如果有匹配,我在目标中测试匹配:

has_match_tgt = re.search(regex_match,tgtsegm)

If both are true, I mark this as an "untranslated" term, because it is the same in source and target languages.如果两者都是正确的,我将其标记为“未翻译”术语,因为它在源语言和目标语言中是相同的。

I am printing results to see what is happening, as:我正在打印结果以查看发生了什么,例如:

print(topword,"untr:",srcsegm,"=====",tgtsegm)
print(topword,"translated:",srcsegm,"=====",tgtsegm)

But the results below are correct when "blu ray" is at the beginning or in the middle of the string:但是当“blu ray”在字符串的开头或中间时,下面的结果是正确的:

blu ray  untr: blu ray rar ===== blu ray rar
blu ray  untr: soul blu ray disney ===== soul blu ray disney

And wrong when blu ray is at the end of the string:当蓝光在字符串的末尾时是错误的:

blu ray  translated: sony blu ray player ===== sony odtwarzacz blu ray

It should say "untr" since we can see "blu ray" in the source segment and also in the target.它应该说“untr”,因为我们可以在源片段和目标片段中看到“蓝光”。 The question is: why is it not producing a match at the end of the string?问题是:为什么它不在字符串末尾产生匹配?

This is the code:这是代码:

topword = input_file0.readline().lower().replace('\n', "") # for ngrams, do not use strip and replace the newline
count_untr = 0
count_tr = 0

while len(topword)>0:   # looping the topword
    count_untr = 0
    count_tr = 0
    srcsegm = input_file1.readline().lower().replace('\n', "")
    tgtsegm = input_file2.readline().lower().replace('\n', "")
    regex_match = re.compile(rf'\b{re.escape(topword)}\b')
 
    while len(srcsegm)>0:     # looping the src and tgt segments for a topword
        has_match_src = re.search(regex_match,srcsegm) 
        if has_match_src != None:
            has_match_tgt = re.search(regex_match,tgtsegm)
            if has_match_tgt != None:
                count_untr += 1
                print(topword,"untr:",srcsegm,"=====",tgtsegm)
            else:
                count_tr += 1
                print(topword,"translated:",srcsegm,"=====",tgtsegm)

Thanks in advance.提前致谢。

If, for example, you were looking for 'star wars' within a string we wanted to allow arbitrary whitespace (such as a newline) between the two words and these words must appear on word boundaries, then the actual regex you would want to use would be:例如,如果您在字符串中查找“星球大战”,我们希望在两个单词之间允许任意空格(例如换行符),并且这些单词必须出现在单词边界上,那么您想要使用的实际正则表达式将会:

\bstar\s+\wars\b

With that in mind you should be splitting topword into into its two component words and building a search regex by escaping each word individually and combing with whitespace between them:考虑到这一点,您应该将topword分成两个组成词,并通过 escaping 每个词单独构建搜索正则表达式,并结合它们之间的空格:

import re

#topword =  input_file0.readline().lower().strip()
topword = 'star wars'

#srcsegm = input_file1.readline().lower().strip()
srcsegm = 'i just saw star wars and loved it!'

#tgtsegm = input_file2.readline().lower().strip()
# newline between star and wars and its at the end of the string:
tgtsegm = 'tomorrow I am planning on seeing star\nwars'

# allow for arbitrary white space between words:
split_words = topword.split()
regex_search = re.compile(rf'\b{re.escape(split_words[0])}\s+{re.escape(split_words[1])}\b')

if regex_search.search(srcsegm) and regex_search.search(tgtsegm):
    print('match in both')

Prints:印刷:

match in both

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM