简体   繁体   中英

Preprocessing data: to remove italian stopwords for text analysis

["

I would like to remove italian stopwords using this function, but I don't know as I can do.<\/i>

! pip install stop-words
from stop_words import get_stop_words

stop = get_stop_words('italian')

    import re
# helper function to clean tweets
def processTweet(tweet):
    # Remove HTML special entities (e.g. &amp;)
    tweet = re.sub(r'\&\w*;', '', tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','',tweet)
    # Remove tickers
    tweet = re.sub(r'\$\w*', '', tweet)
    # To lowercase
    tweet = tweet.lower()
    # Remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*\/\w*', '', tweet)
    # Remove hashtags
    tweet = re.sub(r'#\w*', '', tweet)
    # Remove Punctuation and split 's, 't, 've with a space for filter
    tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#)|(\w+:\/\/\S+)|(\S*\d\S*)|([,;.?!:])",
                                           " ", tweet).split())
    #tweet = re.sub(r'[' + punctuation.replace('@', '') + ']+', ' ', tweet)
    # Remove words with 2 or fewer letters
    tweet = re.sub(r'\b\w{1,3}\b', '', tweet)
    # Remove whitespace (including new line characters)
    tweet = re.sub(r'\s\s+', ' ', tweet)
    # Remove single space remaining at the front of the tweet.
    tweet = tweet.lstrip(' ') 
    # Remove characters beyond Basic Multilingual Plane (BMP) of Unicode:
    tweet = ''.join(c for c in tweet if c <= '\uFFFF') 
    return tweet
df['text'] = df['text'].apply(processTweet)

Just use re.sub() as you've been using:

exclusions = '|'.join(stop)
tweet = re.sub(exclusions, '', tweet)

Consider following example

import re
stops = ["and","or","not"] # list of words to remove
text = "Band and nothing else!" # and in Band and not in nothing should stay
pattern = r'\b(?:' + '|'.join(re.escape(s) for s in stops) + r')\b'
clean = re.sub(pattern, '', text)
print(clean)

output

Band  nothing else!

Explanation: re.escape takes care of characters which have special meaning in regular expression pattern (eg . ) and turn them into literal version (so re.escape(".") matches literal . not any characters), | is alternative, using join alternative of all words is build, (?: ... ) is non-capturing group which allows us to use one \b at begin and one \b at end rather than for each word. \b is word boundary, here used to make sure only whole word is removed and not eg Band turned into B .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM