简体   繁体   English

如何在python中编写正确的nltk正则表达式令牌生成器?

[英]How can I write a correct nltk regular expression tokenizer in python?

I want to implement a regular expression tokenizer with nltk in python but I have following problems. 我想在python中用nltk实现一个正则表达式标记器,但存在以下问题。 I use this page to write my regular expression. 我使用此页面编写我的正则表达式。

import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

def preprocess(sentence):
    sentence = sentence.lower()
    pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
     | \w+(?:-\w+)*        # words with optional internal hyphens
     | \$?\d+(?:\.\d+)?%?
     | \$?\d+%?
     |/\m+(?:[-'/]\w+)*
   '''
   tokenizer = RegexpTokenizer(pattern)
   tokens = tokenizer.tokenize(sentence)
   print tokens

str= 'i have one 98% 0.78 gener-alized 22 rule /m/0987hf /m/08876 i nees packages'
preprocess(str)

I got this 我懂了

['i', 'have', 'one', '98', '0', '78', 'gener-alized', '22', 'rule', '/m/0987hf', '/m/08876', 'i', 'nees', 'packages']

I want this result 我想要这个结果

['i', 'have', 'one', '98%', '0.78', 'gener_alized', '22', 'rule', '/m/0987hf', '/m/08876', 'l', 'need', 'packages' ]

Also, if I want to remove digits what should I write in the regular expression? 另外,如果要删除数字,应在正则表达式中写什么?

Be aware that \\w was designed to parse identifiers in programming languages (I guess) and therefore includes digits. 请注意, \\w旨在解析编程语言中的标识符(我想),因此包含数字。

You should also be aware that order matters in a list of alternatives. 您还应该注意,顺序在替代列表中很重要。 The most specific ones should go first, followed by the more general ones. 最具体的应该放在第一位,其次才是更一般的。

In your example, the second alternative in the pattern, \\w+(?:-\\w+)* , already matches "98" in "98%" or "0" in "0.78" . 在示例中,该图案中的第二替代方案中, \\w+(?:-\\w+)* ,已经匹配"98""98%""0""0.78" After these fragments have matched, there is no pattern that would match "%" or the dot in ".78" , so these are skipped by the tokeniser as token separators. 在这些片段匹配之后,就没有匹配"%"".78"中点的模式,因此令牌化程序将它们跳过作为令牌分隔符。

So, in this case, you should put the number-related subpatterns before the one with \\w , otherwise it will "steal away" digit matches. 因此,在这种情况下,应将与数字相关的子模式放在带有\\w子模式之前,否则它将“偷走”数字匹配项。

Unfortunately, there is no character-class shortcut for alphabetic characters only (like \\d for digits only). 不幸的是,没有仅针对字母字符的字符类快捷方式(例如\\d仅用于数字)。 I have been using [^\\W\\d_] , which means "all characters, except for the ones that are not in \\w or that are in \\d or the underscore", which is the same as "all characters from \\w , but without \\d and without underscores". 我一直在使用[^\\W\\d_] ,意思是“所有字符,除了不在\\w或在\\d或下划线的字符”,与“来自\\w所有字符”相同,但没有\\d并且没有下划线”。 It's not an easily interpretable expression, however. 但是,这不是一个易于解释的表达式。

(Of course you can use [A-Za-z] if you think it's okay to tokenise "Naïve" into ["Na", "ve"] .) (当然,如果您认为可以将"Naïve"标记为["Na", "ve"]则可以使用[A-Za-z] ["Na", "ve"] 。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM