如何強制sklearn CountVectorizer不刪除特殊字符（即＃，@ 、、 $或％）

Question

這是我的代碼：

count = CountVectorizer(lowercase = False)

vocabulary = count.fit_transform([words])
print(count.get_feature_names())

例如，如果：

 words = "Hello @friend, this is a good day. #good."

我希望將其分為以下部分：

['Hello', '@friend', 'this', 'is', 'a', 'good', 'day', '#good']

當前，這是分為以下幾類：

['Hello', 'friend', 'this', 'is', 'a', 'good', 'day']

Answer 1

您可以使用token_pattern從這里參數CountVectorizer作為中提到的文檔：

傳遞一個正則表達式告訴CountVectorizer應該把什么當作單詞。 假設在這種情況下，我們告訴CountVectorizer，即使帶有#或@的單詞也應該是一個單詞。 然后做：

count = CountVectorizer(lowercase = False, token_pattern = '[a-zA-Z0-9$&+,:;=?@#|<>.^*()%!-]+')

輸出：

['#good', '@friend', 'Hello', 'a', 'day', 'good', 'is', 'this']