How can I write a correct nltk regular expression tokenizer in python?

Question

I want to implement a regular expression tokenizer with nltk in python but I have following problems. I use this page to write my regular expression.

import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

def preprocess(sentence):
    sentence = sentence.lower()
    pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
     | \w+(?:-\w+)*        # words with optional internal hyphens
     | \$?\d+(?:\.\d+)?%?
     | \$?\d+%?
     |/\m+(?:[-'/]\w+)*
   '''
   tokenizer = RegexpTokenizer(pattern)
   tokens = tokenizer.tokenize(sentence)
   print tokens

str= 'i have one 98% 0.78 gener-alized 22 rule /m/0987hf /m/08876 i nees packages'
preprocess(str)

I got this

['i', 'have', 'one', '98', '0', '78', 'gener-alized', '22', 'rule', '/m/0987hf', '/m/08876', 'i', 'nees', 'packages']

I want this result

['i', 'have', 'one', '98%', '0.78', 'gener_alized', '22', 'rule', '/m/0987hf', '/m/08876', 'l', 'need', 'packages' ]

Also, if I want to remove digits what should I write in the regular expression?

Answer 1

Be aware that \\w was designed to parse identifiers in programming languages (I guess) and therefore includes digits.

You should also be aware that order matters in a list of alternatives. The most specific ones should go first, followed by the more general ones.

In your example, the second alternative in the pattern, \\w+(?:-\\w+)* , already matches "98" in "98%" or "0" in "0.78" . After these fragments have matched, there is no pattern that would match "%" or the dot in ".78" , so these are skipped by the tokeniser as token separators.

So, in this case, you should put the number-related subpatterns before the one with \\w , otherwise it will "steal away" digit matches.

Unfortunately, there is no character-class shortcut for alphabetic characters only (like \\d for digits only). I have been using [^\\W\\d_] , which means "all characters, except for the ones that are not in \\w or that are in \\d or the underscore", which is the same as "all characters from \\w , but without \\d and without underscores". It's not an easily interpretable expression, however.

(Of course you can use [A-Za-z] if you think it's okay to tokenise "Naïve" into ["Na", "ve"] .)

How can I write a correct nltk regular expression tokenizer in python?

Question

1 answers

solution1
1 2017-02-07 21:11:23

How can I write a correct nltk regular expression tokenizer in python?

Question

1 answers

solution1 1 2017-02-07 21:11:23

solution1
1 2017-02-07 21:11:23