简体   繁体   中英

Python - Function to remove stopwords from pandas series

I have the below data, stored as a Series (called data_counts ), showing words in the Index and count values in the '0' column. Series contains 30k words however I use the below as an example :

Index      |    0

the        |    3425
American   |    431 
a          |    213 
I          |    124
hilarious  |    53
Mexican    |    23
is         |    2 

I'd like to convert the words in the Index to lowercase and remove the stopwords using NLTK. I have seen some examples on SO achieving this using 'lambdas' (see below example for a dataframe), however I'd like to do this by running a DEF function instead (I am a Python newbie and this seems to me the easiest to understand).

df['Index'] = df['Index'].apply(lambda stop_remove: [word.lower() for word in stop_remove.split() if word not in stopwords])

Many thanks in advance

If you really want to define your own function you can use .apply after that rowwise:

from nltk.corpus import stopwords

df = pd.DataFrame(index=['the', 'American', 'a', 'I', 'hilarious', 'Mexican', 'is'],
                  data={ 0:[3425, 431, 213, 124, 53, 23, 2]})

# Clean up dataframe and convert words to lowercase
df['words'] = df.index.str.lower()
df.reset_index(drop=True, inplace=True)

# Define our function to remove stopwords
def remove_stopwords(word):
    if word not in stopwords.words('english'):
        return word
    else:
        return ''

# Apply the function to our words column to clean up.
df['words_clean'] = df.words.apply(remove_stopwords)
print(df)
      0      words words_clean
0  3425        the            
1   431   american    american
2   213          a            
3   124          i            
4    53  hilarious   hilarious
5    23    mexican     mexican
6     2         is             

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM