Python - 字符串中匹配词的索引

Question

我正在寻找一种有效的方法将字符串中的索引转换为索引所在的单词。

例如，如果这是我的字符串：

This is a very stupid string

我得到的索引是，比方说 10，所以 output 应该very 。 此外，如果索引为 11,12 或 13 - output 应该very .

可以假设单词每次都以 1 个空格分隔。 用 for 循环或其他东西做它并不难，问题是是否有更有效的方法（因为我的文本很大，我有很多索引可以转换为单词）。

例如，让索引为 10、13、16，因此 output 应该是：

10 very
13 very
16 stupid

任何帮助，将不胜感激！

Answer 1

以下应该表现得很好。 首先使用split获取字符串中的单词，然后使用enumerate找到它们开始的 indedx 和列表理解：

words = s.split()
# ['This', 'is', 'a', 'very', 'stupid', 'string']
# Obtain the indices where all words begin
ix_start_word = [i+1 for i,s in enumerate(s) if s==' ']
# [5, 8, 10, 15, 22]

现在您可以使用NumPy's np.searchsorted来获取给定索引的单词：

words[np.searchsorted(ix_start_word, ix)]

检查上面的例子：

words[np.searchsorted(ix_start_word, 11)]
#'very'

words[np.searchsorted(ix_start_word, 13)]
# 'very'

words[np.searchsorted(ix_start_word, 16)]
# 'stupid'

Answer 2

我对它的干净程度并不感到特别自豪，但我认为它可以解决问题：

from numpy import cumsum, array

sample = 'This is a very stupid string'

words = sample.split(' ')
lens = [len(_)+1 for _ in words]

ends = cumsum(lens)
starts = array([0] + list(ends[:-1]))

output = {}
for a, b, c in zip(starts, ends, words):
    for i in range(a, b):
        output[i] =  c
for a, b in output.items():
    print(a, b)

0 This
1 This
2 This
3 This
4 This
5 is
6 is
7 is
8 a
9 a
10 very
11 very
12 very
13 very
14 very
15 stupid
16 stupid
17 stupid
18 stupid
19 stupid
20 stupid
21 stupid
22 string
23 string
24 string
25 string
26 string
27 string
28 string

Answer 3

这不是很有效，因为它使用正则表达式，但它是一种无需使用任何循环即可解决问题的方法。

import re

def stuff(pos):
    x = "This is a very stupid string"
    pattern = re.compile(r'\w+\b')
    pattern2 = re.compile(r'.*(\b\w+)')
    end = pattern.search(x, pos=pos).span()[1]
    print(pattern2.search(x, endpos=end).groups()[0])

stuff(2)
stuff(10)
stuff(11)
stuff(16)

结果：

This
very
very
stupid

Python - 字符串中匹配词的索引

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-02-24 16:21:32

解决方案2
0 2020-02-24 16:19:51

解决方案3
0 2020-02-24 16:26:25

Python - 字符串中匹配词的索引

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-02-24 16:21:32

解决方案2 0 2020-02-24 16:19:51

解决方案3 0 2020-02-24 16:26:25

解决方案1
1 已采纳 2020-02-24 16:21:32

解决方案2
0 2020-02-24 16:19:51

解决方案3
0 2020-02-24 16:26:25