簡體   English   中英

根據子字符串匹配和字符串索引從字符串中獲取子字符串

[英]Obtaining substring from string based on substring matching and string index

我有一個肯定包含myWord字符串(在某些情況下多次,只有第一次出現才應處理),但是字符串的長度不同。 其中一些包含數百個子字符串,某些包含僅幾個子字符串。

我想找到一種從文本中獲取摘錄的解決方案。 規則如下:片段應在前后包含myWord和X詞。

像這樣:

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."

myWord = "sentence"

假設我想從“句子”一詞中獲取內容,並像這樣加上/減去3個詞

"example lorem ipsum sentence for a Stackoverflow"

我可以創建一個可行的解決方案,但是它使用字符數來剪切代碼段,而不是使用myWord之前/之后的單詞數。 所以我的問題是,還有沒有更多合適的解決方案,也許是內置的Python函數可以實現我的目標?

我目前使用的解決方案:

myWord = "mollis"
rawText = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse sit amet arcu vulputate, sodales arcu non, finibus odio. Aliquam sed tincidunt nisi, eu scelerisque lectus. Curabitur in nibh enim. Duis arcu ante, mollis sed iaculis non, hendrerit ut odio. Curabitur gravida condimentum posuere. Sed et arcu finibus felis auctor mollis et id risus. Nam urna tellus, ultricies a aliquam at, euismod et erat. Cras pretium venenatis ornare. Donec pulvinar dui eu dui facilisis commodo. Vivamus eget ultrices turpis, vel egestas lacus."

# The index where the word is located
wordIndexNumber = rawText.lower().find("%s" % (myWord,))

# The total length of the text (in chars)
textLength = len(rawText)

textPart2 = len(rawText)-wordIndexNumber

if wordIndexNumber < 80:
    textIndex1 = 0
else:
    textIndex1 = wordIndexNumber - 80

if textPart2 < 80:
    textIndex2 = textLength
else:
    textIndex2 = wordIndexNumber + 80

snippet = rawText[textIndex1:textIndex2]

print (snippet)

這是使用字符串切片的一種方法。

演示:

rawText= "This is an example lorem ipsum sentence for a Stackoverflow question."
myWord = "sentence"
rawTextList = rawText.split()
frontVal = " ".join( rawTextList[rawTextList.index(myWord)-3:rawTextList.index(myWord)] )
backVal = " ".join( rawTextList[rawTextList.index(myWord):rawTextList.index(myWord)+4] )

print("{} {}".format(frontVal, backVal))

輸出:

example lorem ipsum sentence for a Stackoverflow

這是使用數組切片的解決方案

def get_context_around(text, word, accuracy):
    words = text.split()
    first_hit = words.index(word)

    return ' '.join(words[first_hit - accuracy:first_hit + accuracy + 1])


raw_text= "This is an example lorem ipsum sentence for a Stackoverflow question."
my_word = "sentence"
print(get_context_around(raw_text, my_word, accuracy=3)) # example lorem ipsum sentence for a Stackoverflow

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM