简体   繁体   English

根据子字符串匹配和字符串索引从字符串中获取子字符串

[英]Obtaining substring from string based on substring matching and string index

I have a different strings that certainly contains myWord (multiple times in some cases, only the first occurence should be handled), but the length of the strings are different. 我有一个肯定包含myWord字符串(在某些情况下多次,只有第一次出现才应处理),但是字符串的长度不同。 Some of them contains hundreds of substrings, some of the contains only a few substrings. 其中一些包含数百个子字符串,某些包含仅几个子字符串。

I would like to find a solution to obtain a snippet from the text. 我想找到一种从文本中获取摘录的解决方案。 The rules are the following: the snippet should contains myWord and the X words before and after. 规则如下:片段应在前后包含myWord和X词。

Something like this: 像这样:

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."

myWord = "sentence"

Let's say I would like to get the content from the word 'sentence' and plus/minus 3 words like this: 假设我想从“句子”一词中获取内容,并像这样加上/减去3个词

"example lorem ipsum sentence for a Stackoverflow"

I could create a working solution, however it uses the number of chars to cut the snippet instead of the number of words before/after the myWord . 我可以创建一个可行的解决方案,但是它使用字符数来剪切代码段,而不是使用myWord之前/之后的单词数。 So my question would be that is there any more suitable solution, maybe a built-in Python function to achieve my goal? 所以我的问题是,还有没有更多合适的解决方案,也许是内置的Python函数可以实现我的目标?

The current solution I use: 我目前使用的解决方案:

myWord = "mollis"
rawText = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse sit amet arcu vulputate, sodales arcu non, finibus odio. Aliquam sed tincidunt nisi, eu scelerisque lectus. Curabitur in nibh enim. Duis arcu ante, mollis sed iaculis non, hendrerit ut odio. Curabitur gravida condimentum posuere. Sed et arcu finibus felis auctor mollis et id risus. Nam urna tellus, ultricies a aliquam at, euismod et erat. Cras pretium venenatis ornare. Donec pulvinar dui eu dui facilisis commodo. Vivamus eget ultrices turpis, vel egestas lacus."

# The index where the word is located
wordIndexNumber = rawText.lower().find("%s" % (myWord,))

# The total length of the text (in chars)
textLength = len(rawText)

textPart2 = len(rawText)-wordIndexNumber

if wordIndexNumber < 80:
    textIndex1 = 0
else:
    textIndex1 = wordIndexNumber - 80

if textPart2 < 80:
    textIndex2 = textLength
else:
    textIndex2 = wordIndexNumber + 80

snippet = rawText[textIndex1:textIndex2]

print (snippet)

This is one approach using string slicing. 这是使用字符串切片的一种方法。

Demo: 演示:

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."
myWord = "sentence"
rawTextList = rawText.split()
frontVal = " ".join( rawTextList[rawTextList.index(myWord)-3:rawTextList.index(myWord)] )
backVal = " ".join( rawTextList[rawTextList.index(myWord):rawTextList.index(myWord)+4] )

print("{} {}".format(frontVal, backVal))

Output: 输出:

example lorem ipsum sentence for a Stackoverflow

Here is solution using array slicing 这是使用数组切片的解决方案

def get_context_around(text, word, accuracy):
    words = text.split()
    first_hit = words.index(word)

    return ' '.join(words[first_hit - accuracy:first_hit + accuracy + 1])


raw_text= "This is an example lorem ipsum sentence for a Stackoverflow question."
my_word = "sentence"
print(get_context_around(raw_text, my_word, accuracy=3)) # example lorem ipsum sentence for a Stackoverflow

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM