简体   繁体   English

如何获得特定令牌前后的单词?

[英]How can I get words after and before a specific token?

I currently work on a project which is simply creating basic corpus databases and tokenizes texts. 我目前在一个项目中工作,该项目只是创建基本的语料库数据库并标记文本。 But it seems I am stuck in a matter. 但似乎我陷入了困境。 Assume that we have those things: 假设我们有这些东西:

import os, re

texts = []

for i in os.listdir(somedir): # Somedir contains text files which contain very large plain texts.
    with open(i, 'r') as f:
        texts.append(f.read())

Now I want to find the word before and after a token. 现在,我想在标记之前和之后找到单词。

myToken = 'blue'
found = []
for i in texts:
    fnd = re.findall('[a-zA-Z0-9]+ %s [a-zA-Z0-9]+|\. %s [a-zA-Z0-9]+|[a-zA-Z0-9]+ %s\.' %(myToken, myToken, myToken), i, re.IGNORECASE|re.UNICODE)
    found.extend(fnd)

print myToken
for i in found:
    print '\t\t%s' %(i)

I thought there would be three possibilities: The token might start sentence, the token might end sentence or the token might appear somewhere in the sentence, so I used the regex rule above. 我认为可能存在三种可能性:标记可能会开始句子,标记可能会结束句子或者标记可能出现在句子中,因此我使用了上面的regex规则。 When I run, I come across those things: 当我跑步时,我遇到了这些事情:

blue
    My blue car # What I exactly want.
    he blue jac # That's not what I want. That must be "the blue jacket."
    eir blue phone # Wrong! > their
    a blue ali # Wrong! > alien
    . Blue is # Okay.
    is blue. # Okay.
    ...

I also tried \\b\\w\\b or \\b\\W\\b things, but unfortunately those did not return any results instead of returning wrong results. 我也尝试了\\ b \\ w \\ b或\\ b \\ W \\ b东西,但是不幸的是,这些东西没有返回任何结果,而是返回了错误的结果。 I tried: 我试过了:

'\b\w\b%s\b[a-zA-Z0-9]+|\.\b%s\b\w\b|\b\w\b%s\.'
'\b\W\b%s\b[a-zA-Z0-9]+|\.\b%s\b\W\b|\b\W\b%s\.'

I hope question is not too blur. 我希望问题不要太模糊。

Let's say token is test. 假设令牌是测试。

        (?=^test\s+.*|.*?\s+test\s+.*?|.*?\s+test$).*

You can use lookahead.It will not eat up anything and at the same time validate as well. 您可以使用先行方式,它不会吃光任何东西,同时也可以进行验证。

http://regex101.com/r/wK1nZ1/2 http://regex101.com/r/wK1nZ1/2

I think what you want is: 我认为您想要的是:

  1. (Optionally) a word and a space; (可选)单词和空格;
  2. (Always) 'blue' ; (总是) 'blue'
  3. (Optionally) a space and a word. (可选)一个空格和一个单词。

Therefore one appropriate regex would be: 因此,一种合适的正则表达式将是:

r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'

For example: 例如:

>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue."""
>>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']

See demo and token-by-token explanation here . 请参阅此处的演示和逐令牌说明。

Regex can be sometimes slow (if not implemented correctly) and moreover accepted answer did not work for me in several cases. 正则表达式有时会很慢(如果未正确实施),而且在某些情况下,接受的答案对我不起作用。

So I went for the brute force solution (not saying it is the best one), where keyword can be composed of several words: 因此,我采用了蛮力解决方案(并不是说这是最好的解决方案),其中关键字可以由几个单词组成:

@staticmethod
def find_neighbours(word, sentence):
    prepost_map = []

    if word not in sentence:
        return prepost_map

    split_sentence = sentence.split(word)
    for i in range(0, len(split_sentence) - 1):
        prefix = ""
        postfix = ""

        prefix_list = split_sentence[i].split()
        postfix_list = split_sentence[i + 1].split()

        if len(prefix_list) > 0:
            prefix = prefix_list[-1]

        if len(postfix_list) > 0:
            postfix = postfix_list[0]

        prepost_map.append([prefix, word, postfix])

    return prepost_map

Empty string before or after the keyword indicates that keyword was the first or the last word in the sentence, respectively. 关键字之前或之后的空字符串分别表示关键字是句子中的第一个或最后一个单词。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何获取两个特定单词前后的所有字符串? - How to get all the string after and before two specific words? 我怎样才能得到一个我不知道但知道前后单词的单词? - How can I get a word that I don't know but knowing the words after and before? 对列进行标记后,获取特定单词前后的 2 个单词 - After tokenizing a column, get 2 words before and after a specific word pandas 删除特定单词之前的所有单词并获取该特定单词之后的前 n 个单词 - pandas remove all words before a specific word and get the first n words after that specific word Python提取前3个单词和3个单词后带有正则表达式的特定单词列表 - Python extract 3 words before and 3 words after a specific list of words with a regex 我怎么能算出特定的二元词呢? - how can I count the specific bigram words? 如何使用 Python 来标记句子字符串中的单词,具体取决于它们是否在一个特定单词之后和句号之前? - How can use Python to mark words in a sentence string depending on whether they come after one specific word and before a full stop? 如何在字符串中获得前后匹配? - How can I get a after and before match in a string? 如何获得小数点前后的数字? - How can I get numbers after and before decimal point? 如何过滤 DataFrame 以在 Pandas 的列中的特定单词列表之后保留行? - How can I filter a DataFrame that keeps the rows after a specific list of words in a columns in Pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM