[英]how does scikit-learn vectorizer handle punctuation
I understand that:我明白那个:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
has tools to deal with punctuation, namely:有处理标点符号的工具,即:
token_pattern = (?u)\\b\\w\\w+\\b
but how does it actually work?但它实际上是如何工作的? Can anybody provide a SIMPLE example, eg with
grep
or sed
that makes use of that regular expression?任何人都可以提供一个简单的例子,例如使用该正则表达式的
grep
或sed
吗? Thanks.谢谢。
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'.
表示什么构成“令牌”的正则表达式,仅在分析器 == 'word' 时使用。 The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
2 个或更多字母数字字符的默认正则表达式 select 标记(标点符号被完全忽略并始终视为标记分隔符)。
Explanation of the given regex给定正则表达式的解释
(?u)
- represent unicode.(?u)
- 代表 unicode。 This will make\w
,\W
,\b
,\B
,\d
,\D
,\s
and\S
perform matching with Unicode semantics.这将使
\w
,\W
,\b
,\B
,\d
,\D
,\s
和\S
执行与 Unicode 语义匹配。
\b
- Represents word boundary and it assert position of String at boundaries.\b
- 表示字边界,它在边界处断言字符串的 position。
\w
- Matches a single word character ie [0-9a-zA-Z_] .\w
- 匹配单个单词字符,即[0-9a-zA-Z_] 。
\w+
- Matches one or more characters within the word boundaries.\w+
- 匹配单词边界内的一个或多个字符。 Notice in the documentation it is clearly mentioned select tokens of 2 or more alphanumeric characters.请注意,在文档中明确提到了 select 2 个或更多字母数字字符的标记。 This is the reason why the regex doesn't contain
\w+
but it contains\w\w+
.这就是正则表达式不包含
\w+
但包含\w\w+
的原因。
Since;自从; the given regex contains only alphanumeric characters along with
_
;给定的正则表达式仅包含字母数字字符以及
_
; it discards all the single letter tokens(such as I, 1, 2, etc.) as well as any punctuation symbol present.它会丢弃所有单字母标记(例如 I、1、2 等)以及存在的任何标点符号。
You can find the implementation of the given regex using grep
command here.您可以在此处使用
grep
命令找到给定正则表达式的实现。
This link might help for implementing (?u)
in grep
. 此链接可能有助于在
grep
中实现(?u)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.