簡體   English   中英

從 pandas 列中刪除停用詞

[英]Removing stop words from a pandas column

import nltk
nltk.download('punkt')
nltk.download('stopwords')
import datetime
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
# Read data from file 'filename.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
data = pd.read_csv("march20_21.csv") 
# Preview the first 5 lines of the loaded data 
#drop NA rows
data.dropna()
#drop all columns not needed
droppeddata = data.drop(columns=['created_at'])
#drop NA rows
alldata = droppeddata.dropna()

ukdata = alldata[alldata.place.str.contains('England')]
ukdata.drop(columns=['place'])

ukdata['text'].apply(word_tokenize)
eng_stopwords = stopwords.words('english') 

我知道有很多冗余變量,但我仍在努力讓它工作,然后再回去完善它。

我不確定如何從標記列中刪除存儲在變量中的停用詞。 感謝任何幫助,我是 Python 的新手。 謝謝。

  1. 在將 function 應用於列后,您需要將結果分配回列,這不是就地操作。

  2. 標記化后ukdata['text']包含單詞list ,因此您可以在應用中使用列表推導來刪除停用詞。


ukdata['text'] = ukdata['text'].apply(word_tokenize)
eng_stopwords = stopwords.words('english') 
ukdata['text'] = ukdata['text'].apply(lambda words: [word for word in words if word not in eng_stopwords])

最小的例子:
 import pandas as pd from nltk.tokenize import word_tokenize from nltk.corpus import stopwords eng_stopwords = stopwords.words('english') ukdata = pd.DataFrame({'text': ["This is a sentence."]}) ukdata['text'] = ukdata['text'].apply(word_tokenize) ukdata['text'] = ukdata['text'].apply(lambda words: [word for word in words if word not in eng_stopwords])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM