Vectorize string including punctuation and special characters

Question

I need to vectorize different sets of tokenized strings including punctuation and special characters like ?,!,#,/,➧,❤,➽ or ✓ - I am using pandas and scikit-learn for that task but the CountVectorize function only vectorize words and ignoring additional characters. I found this but i have no list of the additional characters and need all of them. Here is my code for that task:

def vectorize (dataframe,column_supplement):
     v = CountVectorizer(analyzer = "word", encoding='utf-8', max_features = 5000)
     x = v.fit_transform(dataframe['string_tokenized'])
     df_result = pd.DataFrame(x.todense(), columns=v.get_feature_names())
     instances = df_result.values.tolist()
     header = list(df_result)
     for i in range(len(header)):
     header[i] = column_supplement+header[i]
     df = pd.DataFrame.from_records(instances, columns=header)
     return df

Thanks for help and ideas!

PS token_pattern (default u'(?u)\\b\\w\\w+\\b') regular expression identifying tokens–by default words that consist of a single character (eg, 'a', '2') are ignored, setting token_pattern to '(?u)\\b\\w+\\b' will include these tokens

Answer 1

You might find the accepted answer provided by @Venkatachalam in this stackoverflow question helpful. Sk Learn CountVectorizer: keeping emojis as words

By using token_pattern=r'[^\\s+]' we set the token_pattern to be any character except one or more whitespaces.

As a result the following items will be treated as tokens:

punctuation sequences like !#$ or even single punctuation marks like * or .
special characters like emojis 😅.
single character letters eg a , C

Vectorize string including punctuation and special characters

Question

1 answers

solution1
0 2021-02-27 23:46:38

Vectorize string including punctuation and special characters

Question

1 answers

solution1 0 2021-02-27 23:46:38

solution1
0 2021-02-27 23:46:38