I need to vectorize different sets of tokenized strings including punctuation and special characters like ?,!,#,/,➧,❤,➽ or ✓ - I am using pandas and scikit-learn for that task but the CountVectorize function only vectorize words and ignoring additional characters. I found this but i have no list of the additional characters and need all of them. Here is my code for that task:
def vectorize (dataframe,column_supplement):
v = CountVectorizer(analyzer = "word", encoding='utf-8', max_features = 5000)
x = v.fit_transform(dataframe['string_tokenized'])
df_result = pd.DataFrame(x.todense(), columns=v.get_feature_names())
instances = df_result.values.tolist()
header = list(df_result)
for i in range(len(header)):
header[i] = column_supplement+header[i]
df = pd.DataFrame.from_records(instances, columns=header)
return df
Thanks for help and ideas!
PS token_pattern (default u'(?u)\\b\\w\\w+\\b') regular expression identifying tokens–by default words that consist of a single character (eg, 'a', '2') are ignored, setting token_pattern to '(?u)\\b\\w+\\b' will include these tokens
You might find the accepted answer provided by @Venkatachalam in this stackoverflow question helpful. Sk Learn CountVectorizer: keeping emojis as words
By using token_pattern=r'[^\\s+]'
we set the token_pattern
to be any character except one or more whitespaces.
As a result the following items will be treated as tokens:
punctuation sequences like !#$
or even single punctuation marks like *
or .
special characters like emojis 😅.
single character letters eg a
, C
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.