如何在python中编写正确的nltk正则表达式令牌生成器？

Question

I want to implement a regular expression tokenizer with nltk in python but I have following problems. 我想在python中用nltk实现一个正则表达式标记器，但存在以下问题。 I use this page to write my regular expression. 我使用此页面编写我的正则表达式。

import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

def preprocess(sentence):
    sentence = sentence.lower()
    pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
     | \w+(?:-\w+)*        # words with optional internal hyphens
     | \$?\d+(?:\.\d+)?%?
     | \$?\d+%?
     |/\m+(?:[-'/]\w+)*
   '''
   tokenizer = RegexpTokenizer(pattern)
   tokens = tokenizer.tokenize(sentence)
   print tokens

str= 'i have one 98% 0.78 gener-alized 22 rule /m/0987hf /m/08876 i nees packages'
preprocess(str)

I got this 我懂了

['i', 'have', 'one', '98', '0', '78', 'gener-alized', '22', 'rule', '/m/0987hf', '/m/08876', 'i', 'nees', 'packages']

I want this result 我想要这个结果

['i', 'have', 'one', '98%', '0.78', 'gener_alized', '22', 'rule', '/m/0987hf', '/m/08876', 'l', 'need', 'packages' ]

Also, if I want to remove digits what should I write in the regular expression? 另外，如果要删除数字，应在正则表达式中写什么？

Answer 1

Be aware that \\w was designed to parse identifiers in programming languages (I guess) and therefore includes digits. 请注意， \\w旨在解析编程语言中的标识符（我想），因此包含数字。

You should also be aware that order matters in a list of alternatives. 您还应该注意，顺序在替代列表中很重要。 The most specific ones should go first, followed by the more general ones. 最具体的应该放在第一位，其次才是更一般的。

In your example, the second alternative in the pattern, \\w+(?:-\\w+)* , already matches "98" in "98%" or "0" in "0.78" . 在示例中，该图案中的第二替代方案中， \\w+(?:-\\w+)* ，已经匹配"98"中"98%"或"0"在"0.78" After these fragments have matched, there is no pattern that would match "%" or the dot in ".78" , so these are skipped by the tokeniser as token separators. 在这些片段匹配之后，就没有匹配"%"或".78"中点的模式，因此令牌化程序将它们跳过作为令牌分隔符。

So, in this case, you should put the number-related subpatterns before the one with \\w , otherwise it will "steal away" digit matches. 因此，在这种情况下，应将与数字相关的子模式放在带有\\w子模式之前，否则它将“偷走”数字匹配项。

Unfortunately, there is no character-class shortcut for alphabetic characters only (like \\d for digits only). 不幸的是，没有仅针对字母字符的字符类快捷方式（例如\\d仅用于数字）。 I have been using [^\\W\\d_] , which means "all characters, except for the ones that are not in \\w or that are in \\d or the underscore", which is the same as "all characters from \\w , but without \\d and without underscores". 我一直在使用[^\\W\\d_] ，意思是“所有字符，除了不在\\w或在\\d或下划线的字符”，与“来自\\w所有字符”相同，但没有\\d并且没有下划线”。 It's not an easily interpretable expression, however. 但是，这不是一个易于解释的表达式。

(Of course you can use [A-Za-z] if you think it's okay to tokenise "Naïve" into ["Na", "ve"] .) （当然，如果您认为可以将"Naïve"标记为["Na", "ve"]则可以使用[A-Za-z] ["Na", "ve"] 。）

如何在python中编写正确的nltk正则表达式令牌生成器？

问题描述

1 个解决方案

解决方案1
1 2017-02-07 21:11:23

如何在python中编写正确的nltk正则表达式令牌生成器？

问题描述

1 个解决方案

解决方案1 1 2017-02-07 21:11:23

解决方案1
1 2017-02-07 21:11:23