简体   繁体   English

向量化字符串,包括标点符号和特殊字符

[英]Vectorize string including punctuation and special characters

I need to vectorize different sets of tokenized strings including punctuation and special characters like ?,!,#,/,➧,❤,➽ or ✓ - I am using pandas and scikit-learn for that task but the CountVectorize function only vectorize words and ignoring additional characters.我需要向量化不同的标记化字符串集,包括标点符号和特殊字符,如 ?,!,#,/,➧,❤,➽ 或 ✓ - 我正在使用 pandas 和 scikit-learn 来完成该任务,但 CountVectorize 函数仅向量化单词和忽略附加字符。 I found this but i have no list of the additional characters and need all of them.我找到了这个,但我没有附加字符的列表,需要所有这些字符。 Here is my code for that task:这是我执行该任务的代码:

def vectorize (dataframe,column_supplement):
     v = CountVectorizer(analyzer = "word", encoding='utf-8', max_features = 5000)
     x = v.fit_transform(dataframe['string_tokenized'])
     df_result = pd.DataFrame(x.todense(), columns=v.get_feature_names())
     instances = df_result.values.tolist()
     header = list(df_result)
     for i in range(len(header)):
     header[i] = column_supplement+header[i]
     df = pd.DataFrame.from_records(instances, columns=header)
     return df

Thanks for help and ideas!感谢您的帮助和想法!

PS token_pattern (default u'(?u)\\b\\w\\w+\\b') regular expression identifying tokens–by default words that consist of a single character (eg, 'a', '2') are ignored, setting token_pattern to '(?u)\\b\\w+\\b' will include these tokens PS token_pattern (默认 u'(?u)\\b\\w\\w+\\b') 正则表达式识别标记——默认情况下,由单个字符(例如,'a'、'2')组成的单词被忽略,设置 token_pattern to '(?u)\\b\\w+\\b' 将包含这些标记

You might find the accepted answer provided by @Venkatachalam in this stackoverflow question helpful.您可能会发现 @Venkatachalam 在此 stackoverflow 问题中提供的已接受答案很有帮助。 Sk Learn CountVectorizer: keeping emojis as words Sk Learn CountVectorizer:将表情符号保持为单词

By using token_pattern=r'[^\\s+]' we set the token_pattern to be any character except one or more whitespaces.通过使用token_pattern=r'[^\\s+]'我们将token_pattern设置为除一个或多个空格之外的任何字符。

As a result the following items will be treated as tokens:因此,以下项目将被视为令牌:

  • punctuation sequences like !#$ or even single punctuation marks like * or .标点符号序列,如!#$甚至单个标点符号,如*.

  • special characters like emojis 😅.特殊字符,如表情符号😅。

  • single character letters eg a , C单字符字母,例如a , C

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从Unicode字符串中去除特殊字符和标点符号 - Strip special characters and punctuation from a unicode string 从字符串中删除所有特殊字符、标点符号和空格 - Remove all special characters, punctuation and spaces from string Python是否针对特殊字符和/或标点内置字符串验证? - Does Python have built in string validation for Special Characters and/or Punctuation? 标点符号后拆分字符串,包括标点符号 - Splitting a string after punctuation while including punctuation 从字符串中删除所有特殊字符,标点符号,并将其限制为前200个字符 - remove all special characters, punctuation from string and limit it to first 200 characters 在 python 中,如何将一系列标记和字符(包括标点符号)组合成一个句子字符串? - How can you combine a list of tokens and characters (including punctuation and symbols) into a single sentence string in python? 反转 Python 字符串中的单词(包括标点符号) - Reversing words in a Python string (including punctuation) 如何使用 re.findall 查找字符串中的所有字符,包括特殊字符? - How to use re.findall to find all characters in a string, including special characters? 如何从任何非 unicode \\ 特殊字符、html 标记、js 中清除字符串 - 留下纯文本和标点符号 - 在 python 中? - How clean the string from any none unicode \ special characters, html markup, js - leaving pure text and punctuation - in python? 创建一个计算单词和字符的函数(包括标点符号,但不包括空格) - Creating a function that counts words and characters (Including punctuation, but excluding white space)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM