預處理數據：刪除意大利語停用詞以進行文本分析

Question

["

! pip install stop-words
from stop_words import get_stop_words

stop = get_stop_words('italian')

    import re
# helper function to clean tweets
def processTweet(tweet):
    # Remove HTML special entities (e.g. &amp;)
    tweet = re.sub(r'\&\w*;', '', tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','',tweet)
    # Remove tickers
    tweet = re.sub(r'\$\w*', '', tweet)
    # To lowercase
    tweet = tweet.lower()
    # Remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*\/\w*', '', tweet)
    # Remove hashtags
    tweet = re.sub(r'#\w*', '', tweet)
    # Remove Punctuation and split 's, 't, 've with a space for filter
    tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#)|(\w+:\/\/\S+)|(\S*\d\S*)|([,;.?!:])",
                                           " ", tweet).split())
    #tweet = re.sub(r'[' + punctuation.replace('@', '') + ']+', ' ', tweet)
    # Remove words with 2 or fewer letters
    tweet = re.sub(r'\b\w{1,3}\b', '', tweet)
    # Remove whitespace (including new line characters)
    tweet = re.sub(r'\s\s+', ' ', tweet)
    # Remove single space remaining at the front of the tweet.
    tweet = tweet.lstrip(' ') 
    # Remove characters beyond Basic Multilingual Plane (BMP) of Unicode:
    tweet = ''.join(c for c in tweet if c <= '\uFFFF') 
    return tweet
df['text'] = df['text'].apply(processTweet)

Answer 1

只需使用你一直在使用的 re.sub() ：

exclusions = '|'.join(stop)
tweet = re.sub(exclusions, '', tweet)

Answer 2

考慮以下示例

import re
stops = ["and","or","not"] # list of words to remove
text = "Band and nothing else!" # and in Band and not in nothing should stay
pattern = r'\b(?:' + '|'.join(re.escape(s) for s in stops) + r')\b'
clean = re.sub(pattern, '', text)
print(clean)

輸出

Band  nothing else!

說明： re.escape處理在正則表達式模式中具有特殊含義的字符（例如. ）並將它們轉換為文字版本（因此re.escape(".")匹配文字.不是任何字符）， | 是替代方法，使用所有單詞的連接替代方法是構建， (?: ... )是非捕獲組，它允許我們在開始時使用一個\b \b在結尾處使用一個 \b 而不是每個單詞。 \b是單詞邊界，這里用於確保僅刪除整個單詞，而不是例如Band變成B 。

預處理數據：刪除意大利語停用詞以進行文本分析

問題描述

2 個解決方案

解決方案1
1 已采納 2022-05-12 10:53:29

解決方案2
1 2022-05-12 10:55:11

預處理數據：刪除意大利語停用詞以進行文本分析

問題描述

2 個解決方案

解決方案1 1 已采納 2022-05-12 10:53:29

解決方案2 1 2022-05-12 10:55:11

解決方案1
1 已采納 2022-05-12 10:53:29

解決方案2
1 2022-05-12 10:55:11