scikit-learn 矢量化器如何处理标点符号

Question

I understand that:我明白那个：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

has tools to deal with punctuation, namely:有处理标点符号的工具，即：

token_pattern = (?u)\\b\\w\\w+\\b

but how does it actually work?但它实际上是如何工作的？ Can anybody provide a SIMPLE example, eg with grep or sed that makes use of that regular expression?任何人都可以提供一个简单的例子，例如使用该正则表达式的grep或sed吗？ Thanks.谢谢。

Answer 1

According to the docs;根据文档；

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'.表示什么构成“令牌”的正则表达式，仅在分析器 == 'word' 时使用。 The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). 2 个或更多字母数字字符的默认正则表达式 select 标记（标点符号被完全忽略并始终视为标记分隔符）。

Explanation of the given regex给定正则表达式的解释

(?u) - represent unicode. (?u) - 代表 unicode。 This will make \w , \W , \b , \B , \d , \D , \s and \S perform matching with Unicode semantics.这将使\w ， \W ， \b ， \B ， \d ， \D ， \s和\S执行与 Unicode 语义匹配。

\b - Represents word boundary and it assert position of String at boundaries. \b - 表示字边界，它在边界处断言字符串的 position。

\w - Matches a single word character ie [0-9a-zA-Z_] . \w - 匹配单个单词字符，即[0-9a-zA-Z_] 。

\w+ - Matches one or more characters within the word boundaries. \w+ - 匹配单词边界内的一个或多个字符。 Notice in the documentation it is clearly mentioned select tokens of 2 or more alphanumeric characters.请注意，在文档中明确提到了 select 2 个或更多字母数字字符的标记。 This is the reason why the regex doesn't contain \w+ but it contains \w\w+ .这就是正则表达式不包含\w+但包含\w\w+的原因。

Since;自从; the given regex contains only alphanumeric characters along with _ ;给定的正则表达式仅包含字母数字字符以及_ ； it discards all the single letter tokens(such as I, 1, 2, etc.) as well as any punctuation symbol present.它会丢弃所有单字母标记（例如 I、1、2 等）以及存在的任何标点符号。

You can find the implementation of the given regex using grep command here.您可以在此处使用grep命令找到给定正则表达式的实现。

This link might help for implementing (?u) in grep . 此链接可能有助于在grep中实现(?u) 。

scikit-learn 矢量化器如何处理标点符号

问题描述

1 个解决方案

解决方案1
2 已采纳

scikit-learn 矢量化器如何处理标点符号

问题描述

1 个解决方案

解决方案1 2 已采纳

解决方案1
2 已采纳