Python - 字符串中匹配詞的索引

Question

我正在尋找一種有效的方法將字符串中的索引轉換為索引所在的單詞。

例如，如果這是我的字符串：

This is a very stupid string

我得到的索引是，比方說 10，所以 output 應該very 。 此外，如果索引為 11,12 或 13 - output 應該very .

可以假設單詞每次都以 1 個空格分隔。 用 for 循環或其他東西做它並不難，問題是是否有更有效的方法（因為我的文本很大，我有很多索引可以轉換為單詞）。

例如，讓索引為 10、13、16，因此 output 應該是：

10 very
13 very
16 stupid

任何幫助，將不勝感激！

Answer 1

以下應該表現得很好。 首先使用split獲取字符串中的單詞，然后使用enumerate找到它們開始的 indedx 和列表理解：

words = s.split()
# ['This', 'is', 'a', 'very', 'stupid', 'string']
# Obtain the indices where all words begin
ix_start_word = [i+1 for i,s in enumerate(s) if s==' ']
# [5, 8, 10, 15, 22]

現在您可以使用NumPy's np.searchsorted來獲取給定索引的單詞：

words[np.searchsorted(ix_start_word, ix)]

檢查上面的例子：

words[np.searchsorted(ix_start_word, 11)]
#'very'

words[np.searchsorted(ix_start_word, 13)]
# 'very'

words[np.searchsorted(ix_start_word, 16)]
# 'stupid'

Answer 2

我對它的干凈程度並不感到特別自豪，但我認為它可以解決問題：

from numpy import cumsum, array

sample = 'This is a very stupid string'

words = sample.split(' ')
lens = [len(_)+1 for _ in words]

ends = cumsum(lens)
starts = array([0] + list(ends[:-1]))

output = {}
for a, b, c in zip(starts, ends, words):
    for i in range(a, b):
        output[i] =  c
for a, b in output.items():
    print(a, b)

0 This
1 This
2 This
3 This
4 This
5 is
6 is
7 is
8 a
9 a
10 very
11 very
12 very
13 very
14 very
15 stupid
16 stupid
17 stupid
18 stupid
19 stupid
20 stupid
21 stupid
22 string
23 string
24 string
25 string
26 string
27 string
28 string

Answer 3

這不是很有效，因為它使用正則表達式，但它是一種無需使用任何循環即可解決問題的方法。

import re

def stuff(pos):
    x = "This is a very stupid string"
    pattern = re.compile(r'\w+\b')
    pattern2 = re.compile(r'.*(\b\w+)')
    end = pattern.search(x, pos=pos).span()[1]
    print(pattern2.search(x, endpos=end).groups()[0])

stuff(2)
stuff(10)
stuff(11)
stuff(16)

結果：

This
very
very
stupid

Python - 字符串中匹配詞的索引

問題描述

3 個解決方案

解決方案1
1 已采納 2020-02-24 16:21:32

解決方案2
0 2020-02-24 16:19:51

解決方案3
0 2020-02-24 16:26:25

Python - 字符串中匹配詞的索引

問題描述

3 個解決方案

解決方案1 1 已采納 2020-02-24 16:21:32

解決方案2 0 2020-02-24 16:19:51

解決方案3 0 2020-02-24 16:26:25

解決方案1
1 已采納 2020-02-24 16:21:32

解決方案2
0 2020-02-24 16:19:51

解決方案3
0 2020-02-24 16:26:25