简体   繁体   中英

how does scikit-learn vectorizer handle punctuation

I understand that:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

has tools to deal with punctuation, namely:

token_pattern = (?u)\\b\\w\\w+\\b

but how does it actually work? Can anybody provide a SIMPLE example, eg with grep or sed that makes use of that regular expression? Thanks.

According to the docs;

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

Explanation of the given regex

(?u) - represent unicode. This will make \w , \W , \b , \B , \d , \D , \s and \S perform matching with Unicode semantics.

\b - Represents word boundary and it assert position of String at boundaries.

\w - Matches a single word character ie [0-9a-zA-Z_] .

\w+ - Matches one or more characters within the word boundaries. Notice in the documentation it is clearly mentioned select tokens of 2 or more alphanumeric characters. This is the reason why the regex doesn't contain \w+ but it contains \w\w+ .

Since; the given regex contains only alphanumeric characters along with _ ; it discards all the single letter tokens(such as I, 1, 2, etc.) as well as any punctuation symbol present.

You can find the implementation of the given regex using grep command here.

This link might help for implementing (?u) in grep .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM