简体   繁体   中英

How can I write a correct nltk regular expression tokenizer in python?

I want to implement a regular expression tokenizer with nltk in python but I have following problems. I use this page to write my regular expression.

import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

def preprocess(sentence):
    sentence = sentence.lower()
    pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
     | \w+(?:-\w+)*        # words with optional internal hyphens
     | \$?\d+(?:\.\d+)?%?
     | \$?\d+%?
     |/\m+(?:[-'/]\w+)*
   '''
   tokenizer = RegexpTokenizer(pattern)
   tokens = tokenizer.tokenize(sentence)
   print tokens

str= 'i have one 98% 0.78 gener-alized 22 rule /m/0987hf /m/08876 i nees packages'
preprocess(str)

I got this

['i', 'have', 'one', '98', '0', '78', 'gener-alized', '22', 'rule', '/m/0987hf', '/m/08876', 'i', 'nees', 'packages']

I want this result

['i', 'have', 'one', '98%', '0.78', 'gener_alized', '22', 'rule', '/m/0987hf', '/m/08876', 'l', 'need', 'packages' ]

Also, if I want to remove digits what should I write in the regular expression?

Be aware that \\w was designed to parse identifiers in programming languages (I guess) and therefore includes digits.

You should also be aware that order matters in a list of alternatives. The most specific ones should go first, followed by the more general ones.

In your example, the second alternative in the pattern, \\w+(?:-\\w+)* , already matches "98" in "98%" or "0" in "0.78" . After these fragments have matched, there is no pattern that would match "%" or the dot in ".78" , so these are skipped by the tokeniser as token separators.

So, in this case, you should put the number-related subpatterns before the one with \\w , otherwise it will "steal away" digit matches.

Unfortunately, there is no character-class shortcut for alphabetic characters only (like \\d for digits only). I have been using [^\\W\\d_] , which means "all characters, except for the ones that are not in \\w or that are in \\d or the underscore", which is the same as "all characters from \\w , but without \\d and without underscores". It's not an easily interpretable expression, however.

(Of course you can use [A-Za-z] if you think it's okay to tokenise "Naïve" into ["Na", "ve"] .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM