简体   繁体   中英

Vectorize string including punctuation and special characters

I need to vectorize different sets of tokenized strings including punctuation and special characters like ?,!,#,/,➧,❤,➽ or ✓ - I am using pandas and scikit-learn for that task but the CountVectorize function only vectorize words and ignoring additional characters. I found this but i have no list of the additional characters and need all of them. Here is my code for that task:

def vectorize (dataframe,column_supplement):
     v = CountVectorizer(analyzer = "word", encoding='utf-8', max_features = 5000)
     x = v.fit_transform(dataframe['string_tokenized'])
     df_result = pd.DataFrame(x.todense(), columns=v.get_feature_names())
     instances = df_result.values.tolist()
     header = list(df_result)
     for i in range(len(header)):
     header[i] = column_supplement+header[i]
     df = pd.DataFrame.from_records(instances, columns=header)
     return df

Thanks for help and ideas!

PS token_pattern (default u'(?u)\\b\\w\\w+\\b') regular expression identifying tokens–by default words that consist of a single character (eg, 'a', '2') are ignored, setting token_pattern to '(?u)\\b\\w+\\b' will include these tokens

You might find the accepted answer provided by @Venkatachalam in this stackoverflow question helpful. Sk Learn CountVectorizer: keeping emojis as words

By using token_pattern=r'[^\\s+]' we set the token_pattern to be any character except one or more whitespaces.

As a result the following items will be treated as tokens:

  • punctuation sequences like !#$ or even single punctuation marks like * or .

  • special characters like emojis 😅.

  • single character letters eg a , C

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM