简体   繁体   English

Python - 字符串中匹配词的索引

[英]Python - index in string to matching word

I'm looking for an efficient way to convert an index in a string to the word the index is in.我正在寻找一种有效的方法将字符串中的索引转换为索引所在的单词。

For example, if this is my string:例如,如果这是我的字符串:

This is a very stupid string

and the index I get is, let's say 10, so the output should be very .我得到的索引是,比方说 10,所以 output 应该very Also if the index is 11,12 or 13 - the output should be very .此外,如果索引为 11,12 或 13 - output 应该very .

One can assume that the words are separated by 1 space each time.可以假设单词每次都以 1 个空格分隔。 Doing it with a for loop or something is not hard, the question is whether there is a more efficient way (as my text is HUGE and I have MANY indices to convert to words).用 for 循环或其他东西做它并不难,问题是是否有更有效的方法(因为我的文本很大,我有很多索引可以转换为单词)。

For the example, let the indices be 10, 13, 16 and thus the output should be:例如,让索引为 10、13、16,因此 output 应该是:

10 very
13 very
16 stupid

Any help would be appreciated!任何帮助,将不胜感激!

The following should perform quite well.以下应该表现得很好。 Begin by obtaining the words in the string using split , and find the indedx where they begin using enumerate and a list comprehension:首先使用split获取字符串中的单词,然后使用enumerate找到它们开始的 indedx 和列表理解:

words = s.split()
# ['This', 'is', 'a', 'very', 'stupid', 'string']
# Obtain the indices where all words begin
ix_start_word = [i+1 for i,s in enumerate(s) if s==' ']
# [5, 8, 10, 15, 22]

And now you could use NumPy's np.searchsorted to obtain a word given an index:现在您可以使用NumPy's np.searchsorted来获取给定索引的单词:

words[np.searchsorted(ix_start_word, ix)]

Checking with the examples above:检查上面的例子:

words[np.searchsorted(ix_start_word, 11)]
#'very'

words[np.searchsorted(ix_start_word, 13)]
# 'very'

words[np.searchsorted(ix_start_word, 16)]
# 'stupid'

I'm not particularly proud of how clean this is, but I think it does the trick:我对它的干净程度并不感到特别自豪,但我认为它可以解决问题:

from numpy import cumsum, array

sample = 'This is a very stupid string'

words = sample.split(' ')
lens = [len(_)+1 for _ in words]

ends = cumsum(lens)
starts = array([0] + list(ends[:-1]))

output = {}
for a, b, c in zip(starts, ends, words):
    for i in range(a, b):
        output[i] =  c
for a, b in output.items():
    print(a, b)
0 This
1 This
2 This
3 This
4 This
5 is
6 is
7 is
8 a
9 a
10 very
11 very
12 very
13 very
14 very
15 stupid
16 stupid
17 stupid
18 stupid
19 stupid
20 stupid
21 stupid
22 string
23 string
24 string
25 string
26 string
27 string
28 string

This isn't very effective, because it uses regular-expression, but is one way to solve the problem without using any loops.这不是很有效,因为它使用正则表达式,但它是一种无需使用任何循环即可解决问题的方法。

import re

def stuff(pos):
    x = "This is a very stupid string"
    pattern = re.compile(r'\w+\b')
    pattern2 = re.compile(r'.*(\b\w+)')
    end = pattern.search(x, pos=pos).span()[1]
    print(pattern2.search(x, endpos=end).groups()[0])

stuff(2)
stuff(10)
stuff(11)
stuff(16)

Results:结果:

This
very
very
stupid

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM