繁体   English   中英

查找一个句子是否有另一个句子的起始词或同一个句子的结束词

[英]Find whether a sentence has the starting words of another sentence or the ending words of the same sentence

例如,我有一组这样的句子:

New York is in New York State
D.C. is the capital of United States
The weather is cool in the south of that country.
Lets take a bus to get to point b from point a.

还有这样的一句话:

is cool in the south of that country

输出应该是: The weather is cool in the south of that country.

如果我有一个像of United States The weather is cool这样的输入of United States The weather is cool ,输出应该是:

D.C. is the capital of United States The weather is cool in the south of that country.

到目前为止,我尝试了difflib并获得了重叠,但这并不能完全解决所有情况下的问题。

您可以根据句子构建一个包含起始表达式和结束表达式的字典。 然后在这些词典中找到要扩展的句子的前缀和后缀。 在这两种情况下,您都需要为从开头和结尾开始的每个单词子串构建/检查一个键:

sentences="""New York is in New York State
D.C. is the capital of United States
The weather is cool in the south of that country
Lets take a bus to get to point b from point a""".split("\n")

ends   =  { tuple(sWords[i:]):sWords[:i] for s in sentences
               for sWords in [s.split()] for i in range(len(sWords)) }
starts  = { tuple(sWords[:i]):sWords[i:] for s in sentences
               for sWords in [s.split()] for i in range(1,len(sWords)+1) }

def extendSentence(sentence):
    sWords   = sentence.split(" ")
    prefix   = next( (ends[p] for i in range(1,len(sWords)+1)
                      for p in [tuple(sWords[:i])] if p in ends),
                    [])
    suffix   = next( (starts[p] for i in range(len(sWords))
                      for p in [tuple(sWords[i:])] if p in starts),
                    [])  
    return " ".join(prefix + [sentence] + suffix)

输出:

print(extendSentence("of United States The weather is cool"))

# D.C. is the capital of United States The weather is cool in the south of that country

print(extendSentence("is cool in the south of that country"))

# The weather is cool in the south of that country

请注意,我必须删除句子末尾的句点,因为它们会阻止匹配。 您需要在字典构建步骤中清理这些

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM