向量化字符串，包括标点符号和特殊字符

Question

I need to vectorize different sets of tokenized strings including punctuation and special characters like ?,!,#,/,➧,❤,➽ or ✓ - I am using pandas and scikit-learn for that task but the CountVectorize function only vectorize words and ignoring additional characters.我需要向量化不同的标记化字符串集，包括标点符号和特殊字符，如 ?,!,#,/,➧,❤,➽ 或 ✓ - 我正在使用 pandas 和 scikit-learn 来完成该任务，但 CountVectorize 函数仅向量化单词和忽略附加字符。 I found this but i have no list of the additional characters and need all of them.我找到了这个，但我没有附加字符的列表，需要所有这些字符。 Here is my code for that task:这是我执行该任务的代码：

def vectorize (dataframe,column_supplement):
     v = CountVectorizer(analyzer = "word", encoding='utf-8', max_features = 5000)
     x = v.fit_transform(dataframe['string_tokenized'])
     df_result = pd.DataFrame(x.todense(), columns=v.get_feature_names())
     instances = df_result.values.tolist()
     header = list(df_result)
     for i in range(len(header)):
     header[i] = column_supplement+header[i]
     df = pd.DataFrame.from_records(instances, columns=header)
     return df

Thanks for help and ideas!感谢您的帮助和想法！

PS token_pattern (default u'(?u)\\b\\w\\w+\\b') regular expression identifying tokens–by default words that consist of a single character (eg, 'a', '2') are ignored, setting token_pattern to '(?u)\\b\\w+\\b' will include these tokens PS token_pattern (默认 u'(?u)\\b\\w\\w+\\b') 正则表达式识别标记——默认情况下，由单个字符（例如，'a'、'2'）组成的单词被忽略，设置 token_pattern to '(?u)\\b\\w+\\b' 将包含这些标记

Answer 1

You might find the accepted answer provided by @Venkatachalam in this stackoverflow question helpful.您可能会发现 @Venkatachalam 在此 stackoverflow 问题中提供的已接受答案很有帮助。 Sk Learn CountVectorizer: keeping emojis as words Sk Learn CountVectorizer：将表情符号保持为单词

By using token_pattern=r'[^\\s+]' we set the token_pattern to be any character except one or more whitespaces.通过使用token_pattern=r'[^\\s+]'我们将token_pattern设置为除一个或多个空格之外的任何字符。

As a result the following items will be treated as tokens:因此，以下项目将被视为令牌：

punctuation sequences like !#$ or even single punctuation marks like * or .标点符号序列，如!#$甚至单个标点符号，如*或.
special characters like emojis 😅.特殊字符，如表情符号😅。
single character letters eg a , C单字符字母，例如a , C

向量化字符串，包括标点符号和特殊字符

问题描述

1 个解决方案

解决方案1
0 2021-02-27 23:46:38

向量化字符串，包括标点符号和特殊字符

问题描述

1 个解决方案

解决方案1 0 2021-02-27 23:46:38

解决方案1
0 2021-02-27 23:46:38