简体   繁体   English

scikit-learn 矢量化器如何处理标点符号

[英]how does scikit-learn vectorizer handle punctuation

I understand that:我明白那个:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

has tools to deal with punctuation, namely:有处理标点符号的工具,即:

token_pattern = (?u)\\b\\w\\w+\\b

but how does it actually work?但它实际上是如何工作的? Can anybody provide a SIMPLE example, eg with grep or sed that makes use of that regular expression?任何人都可以提供一个简单的例子,例如使用该正则表达式的grepsed吗? Thanks.谢谢。

According to the docs;根据文档;

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'.表示什么构成“令牌”的正则表达式,仅在分析器 == 'word' 时使用。 The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). 2 个或更多字母数字字符的默认正则表达式 select 标记(标点符号被完全忽略并始终视为标记分隔符)。

Explanation of the given regex给定正则表达式的解释

(?u) - represent unicode. (?u) - 代表 unicode。 This will make \w , \W , \b , \B , \d , \D , \s and \S perform matching with Unicode semantics.这将使\w\W\b\B\d\D\s\S执行与 Unicode 语义匹配。

\b - Represents word boundary and it assert position of String at boundaries. \b - 表示字边界,它在边界处断言字符串的 position。

\w - Matches a single word character ie [0-9a-zA-Z_] . \w - 匹配单个单词字符,即[0-9a-zA-Z_]

\w+ - Matches one or more characters within the word boundaries. \w+ - 匹配单词边界内的一个或多个字符。 Notice in the documentation it is clearly mentioned select tokens of 2 or more alphanumeric characters.请注意,在文档中明确提到了 select 2 个或更多字母数字字符的标记。 This is the reason why the regex doesn't contain \w+ but it contains \w\w+ .这就是正则表达式不包含\w+但包含\w\w+的原因。

Since;自从; the given regex contains only alphanumeric characters along with _ ;给定的正则表达式仅包含字母数字字符以及_ it discards all the single letter tokens(such as I, 1, 2, etc.) as well as any punctuation symbol present.它会丢弃所有单字母标记(例如 I、1、2 等)以及存在的任何标点符号。

You can find the implementation of the given regex using grep command here.您可以在此处使用grep命令找到给定正则表达式的实现。

This link might help for implementing (?u) in grep . 链接可能有助于在grep中实现(?u)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 spaCy 和 scikit-learn 向量化器 - spaCy and scikit-learn vectorizer 如何告诉scikit-learn vectorizer使用特定功能? - How to tell scikit-learn vectorizer use specific features? 如何保存使用矢量化器、管道和 GridSearchV 的 scikit-learn 分类器? - How to save a scikit-learn classifier that utilizes a vectorizer, a pipeline and GridSearchV? 使用 Scikit-Learn 创建自定义计数向量器 - Creating custom Count Vectorizer with Scikit-Learn 如何在scikit-learn中使用名称处理数据? - How to handle data with names in scikit-learn? Scikit-Learn管道:如何处理预处理 - Scikit-Learn Pipeline: How to Handle Preprocessing 如何在 Scikit-Learn 文本 CountVectorizer 或 TfidfVectorizer 中保留标点符号? - How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer? 如何在scikit-learn中对向量化器进行子类化而无需在构造函数中重复所有参数 - How to subclass a vectorizer in scikit-learn without repeating all parameters in the constructor 在scikit-learn中向文本矢量化器添加新单词 - Adding new words to text vectorizer in scikit-learn SciKit-Learn 中的 TFIDF 矢量化器仅返回 5 个结果 - TFIDF Vectorizer within SciKit-Learn only returning 5 results
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM