清理文本數據以進行情感分析和詞袋

Question

我目前正在一個項目中測試和訓練數據以進行情緒分析。 因為，我遇到了一個與 re.sub() 相關的問題，我無法弄清楚如何解決這個問題。 我的代碼如下：

import re def preprocessor(text):
    text = re.sub(r"<[^>]*>", "",  text) # removes all the html markup
    emoticons = re.findall('(?::|;|= )(?:-)?(?:\)|\(|D|P)', text)
    # removed all the non word charecter and convert them into lower case
    text = (re.sub(r'[\W]+', '', text.lower()) + ''.join(emoticons).replace('-', ''))
    return text

如您所見，function 工作正常，沒有異常。 但是，因為我想打印文本以查看它是否產生我想要的結果，所以我得到以下 output：

preprocessor(df.loc[0, 'review'][-50:])` 



'isseventitlebrazilnotavailable'

而我的願望 output 應該是：

'is seven title brazil not available'

我有點猜想我的 re.sub() 正在刪除所有空格，但我不知道如何解決這個問題。

一個答案將是可觀的。

注意：我想按如下方式清除字符串：例如：from 'is 7.

標題（巴西）：不可用'到

'is seven title brazil not available'

謝謝

Answer 1

您可以嘗試以下方法：

text = 'is seven.<br /><br />Title (Brazil): Not  Available' 
## remove tags
text = re.sub(r"<.*?>", " ",  text)
## sub with blank
text = re.sub(r'[^a-zA-Z0-9\s+]', '', text)
print(text)

output：

'is seven Title Brazil Not Available'

Answer 2

當您在正則表達式中使用\W時，它也包括空白字符。 在您的情況下，這些也被空字符串替換。 為了演示，這里是一段代碼，

import re

text = "This is my Text"
text1 = re.sub(r'[\W]+', '', text.lower())
text2 = re.sub(r'[^a-zA-Z0-9_\s]+', '', text.lower())

print(text1)
print(text2)

如果您檢查文檔[^a-zA-Z0-9_]實際上等同於\W 。 如果您不希望它們被空字符串替換（如上面的text2示例中所做的那樣），則需要在該列表中添加空白正則表達式符號 ( \s )。

清理文本數據以進行情感分析和詞袋

問題描述

2 個解決方案

解決方案1
1 已采納 2020-06-23 16:23:26

解決方案2
1 2020-06-23 16:37:20

清理文本數據以進行情感分析和詞袋

問題描述

2 個解決方案

解決方案1 1 已采納 2020-06-23 16:23:26

解決方案2 1 2020-06-23 16:37:20

解決方案1
1 已采納 2020-06-23 16:23:26

解決方案2
1 2020-06-23 16:37:20