根據標簽列表中該單詞的索引位置，查找字符串中單詞的開始和結束位置

Question

我有一句話

str = 'cold weather gives me cold'

和一份清單

 tag = ['O','O','O','O','disease']

這表明句子中的第5個單詞是疾病類型。 現在我需要獲得第5個單詞的開始和結束位置。

如果我只是用'冷'進行字符串搜索，它將給出我首先出現的“冷”的起始位置。

Answer 1

這應該做到這一點。

def get(str,target_index):
  start = len(" ".join(str.split(" ")[:target_index])) + 1
  end = start + len(str.replace('.','').split(' ')[target_index])
  return (start,end)

str = 'cold weather gives me cold.' 
tag = ['O','O','O','O','disease']
start,end = get(str,tag.index('disease'))
print(start,end,str[start:end]) # outputs 22 26 cold

str = 'cold weather gives me cold'
tag = ['O','O','O','O','disease']
start,end = get(str,tag.index('disease'))
print(start,end,str[start:end]) # outputs 22 26 cold

str = 'cold weather gives me cold and cough' 
tag = ['O','O','O','O','disease']
start,end = get(str,tag.index('disease'))
print(start,end,str[start:end]) # outputs 22 26 cold

在這里看到它。

希望能幫助到你！

Answer 2

首先從標簽中找到疾病指數，然后從數據中查找疾病名稱，然后查找開始和結束指數：

strData = 'cold weather gives me cold' 
tag = ['O','O','O','O','disease']
diseaseIndex = tag.index('disease')
diseaseName = strData.split()[diseaseIndex]
print(diseaseName)
diseaseNameStartIndex = sum(len(word) for (index, word) in enumerate(strData.split()) if index< diseaseIndex ) + diseaseIndex
diseaseNameEndIndex = diseaseNameStartIndex + len(diseaseName) -1
print("diseaseNameStartIndex = ",diseaseNameStartIndex)
print("diseaseNameEndIndex = ",diseaseNameEndIndex)

輸出：

cold
diseaseNameStartIndex =  22
diseaseNameEndIndex =  25

Answer 3

以下將輸出給定單詞的開始和結束位置，假設單詞用空格分隔：

str = 'cold weather gives me cold'
word_idx = 4 # index of the word we are looking for

split_str = str.split(' ')
print(split_str[word_idx]) # outputs 'cold'

start_pos = 0
for i in range(word_idx):
    start_pos += len(split_str[i]) + 1 # add one because of the spaces between words
end_pos = start_pos + len(split_str[word_idx]) - 1

print(start_pos) # prints 22
print(end_pos) # prints 25

Answer 4

您可以簡單地拆分字符串然后再次加入它，但這有點尷尬。

string_list = string.split(" ")
word_start = len(" ".join(string_list[:4])) + 1
word_end = word_start + len(string_list[4])

Answer 5

使用itertools和re ：

import re
from itertools import accumulate

def find_index(string, n):
    words = string.split()
    len_word = len(words[n])
    end_index = list(accumulate(map(len, re.split('(\s)' , string))))[::2][n]
    return end_index - len_word, end_index - 1

使用它：

find_index('cold weather gives me cold', 4) #5th word means 4 in indexing

輸出：

(22, 25)

Answer 6

嘗試使用此功能：

def find_index(s, n):
    length = len(s.split()[n])
    index = [(0, len(s.split()[0]) - 1)]
    for i in s.split():
        index.append((index[-1][0] + len(i), index[-1][1] + len(i)))
    return index[n + 1]
print(find_index('cold weather gives me cold', 4))

輸出：

(22, 25)

Answer 7

如果你有很長一段線做到這一點，最好是使用一個iterator ，這樣會生成字啟動和使用結束位置re.finditer方法，然后找到迭代器使用的第n個元素islice

>>> str = 'cold weather gives me cold' 
>>> word_pos = iter((match.group(), match.span(1)) for match in re.finditer(r'(\S+)\S', string))
>>>
>>> n=4
>>> next(islice(word_pos, n, n+1))
('cold', (22, 25))

Answer 8

您可以將re與列表理解結合使用：

import re
s = 'cold weather gives me cold' 
new_s = re.findall('\w+|\s+', s)
l = [(a, sum(map(len, new_s[:i]))) for i, a in enumerate(new_s) if a != ' ']

tag = ['O','O','O','O','disease'] 
result = [[c if not c else c, c+len(d)] for a, [d, c] in zip(tag, l) if a == 'disease']

輸出：

[[22, 26]]

根據標簽列表中該單詞的索引位置，查找字符串中單詞的開始和結束位置

問題描述

8 個解決方案

解決方案1
1 已采納 2019-07-24 06:55:41

解決方案2
1 2019-07-24 07:07:48

解決方案3
0 2019-07-24 06:41:49

解決方案4
0 2019-07-24 06:42:04

解決方案5
0 2019-07-24 06:46:00

解決方案6
0 2019-07-24 07:10:51

解決方案7
0 2019-07-24 07:34:55

解決方案8
0 2019-07-24 14:31:49

根據標簽列表中該單詞的索引位置，查找字符串中單詞的開始和結束位置

問題描述

8 個解決方案

解決方案1 1 已采納 2019-07-24 06:55:41

解決方案2 1 2019-07-24 07:07:48

解決方案3 0 2019-07-24 06:41:49

解決方案4 0 2019-07-24 06:42:04

解決方案5 0 2019-07-24 06:46:00

解決方案6 0 2019-07-24 07:10:51

解決方案7 0 2019-07-24 07:34:55

解決方案8 0 2019-07-24 14:31:49

解決方案1
1 已采納 2019-07-24 06:55:41

解決方案2
1 2019-07-24 07:07:48

解決方案3
0 2019-07-24 06:41:49

解決方案4
0 2019-07-24 06:42:04

解決方案5
0 2019-07-24 06:46:00

解決方案6
0 2019-07-24 07:10:51

解決方案7
0 2019-07-24 07:34:55

解決方案8
0 2019-07-24 14:31:49