获取字符串中数字的索引并提取数字前后的单词（不同语言）

Question

I tried using regex and found numbers but not finding the indices for the entire number, instead getting index it only for the first character in the number我尝试使用正则表达式并找到数字但没有找到整个数字的索引，而是只为数字中的第一个字符获取索引

text = "४०० pounds of wheat at $ 3 per pound"
numero = re.finditer(r"(\d+)", text) ####
op = re.findall(r"(\d+)", text) ####

indices = [m.start() for m in numero]
OUTPUT

[0, 25]

***Expected OUTPUT***
[0, 6]

After finding the exact indices and storing in a list, it would be easier to extract the words.在找到确切的索引并存储在列表中之后，提取单词会更容易。 This is what I believe?这是我相信的吗？ What do you think?你怎么看？

Also, I am expecting words at different positions so it cannot be a static approach另外，我期待不同位置的单词，所以它不能是 static 方法

Answer 1

You tagged the question with nlp tag and it is a python question, why don't you use Spacy ?你用nlp标签标记了这个问题，它是一个python问题，你为什么不使用Spacy ？

See an Python demo with Spacy 3.0.1:查看带有 Spacy 3.0.1 的 Python 演示：

import spacy
nlp = spacy.load("en_core_web_trf")
text = "४०० pounds of wheat at $ 3 per pound"
doc = nlp(text)
print([(token.text, token.i) for token in doc if token.is_alpha])
## => [('pounds', 1), ('of', 2), ('wheat', 3), ('at', 4), ('per', 7), ('pound', 8)]
## => print([(token.text, token.i) for token in doc if token.like_num])
[('४००', 0), ('3', 6)]

Here,这里，

nlp object is initialized with the English "big" model nlp object 初始化为英文“大” model
doc is the Spacy document initialized with your text variable doc是使用您的text变量初始化的 Spacy 文档
[(token.text, token.i) for token in doc if token.is_alpha] gets you a list of letter words with their values ( token.text ) and their positions in the document ( token.i ) [(token.text, token.i) for token in doc if token.is_alpha]您提供包含其值 ( token.text ) 及其在文档中的位置 ( token.i ) 的字母单词列表
[(token.text, token.i) for token in doc if token.like_num] fetches the list of numbers with their positions inside the document. [(token.text, token.i) for token in doc if token.like_num]获取数字列表及其在文档中的位置。

Answer 2

You can tokenize it and build your logic that way.您可以对其进行标记并以这种方式构建您的逻辑。 Try this:尝试这个：


number_index = []
text = "४०० pounds of wheat at $ 3 per pound"
text_list = text.split(" ")

# Find which words are integers.
for index, word in enumerate(text_list):
    try:
        int(word)
        number_index.append(index)
    except:
        pass

# Now perform operations on those integers
for i in number_index:
    word = text_list[i]
    # do operations and put it back in the list

# Re-build string afterwards

获取字符串中数字的索引并提取数字前后的单词（不同语言）

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-03-14 22:33:56

解决方案2
0 2021-03-14 03:28:58

获取字符串中数字的索引并提取数字前后的单词（不同语言）

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-03-14 22:33:56

解决方案2 0 2021-03-14 03:28:58

解决方案1
1 已采纳 2021-03-14 22:33:56

解决方案2
0 2021-03-14 03:28:58