![](/img/trans.png)
[英]AttributeError: 'list' object has no attribute 'lower' with CountVectorizer
[英]CountVectorizer does not process my text data. It keep giving me AttributeError: 'list' object has no attribute 'lower'
I have created process_textData function that takes in a pandas DataFrame column of text, then performs the following: 1. Convert text to lower case and remove all punctuation 2. Optionally apply stemming 3. Apply Ngram Tokenisation 4. Returns the tokenised text as a list .
import string
from nltk.stem.snowball import SnowballStemmer
from nltk import everygrams, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
def process_text(data, n=1):
stemmer = SnowballStemmer('english')
data = data.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
data = data.apply(lambda x: [' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)])
data = data.apply(lambda x: [stemmer.stem(word) for word in x])
return data
之后,我將 function 實施到 Sklearn CountVectorizer 中,它給了我這個錯誤:
AttributeError: 'list' object has no attribute 'lower'.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words=None, ngram_range=(3, 3))
X = cv.fit_transform(process_textData(df.news, n=3))
X.toarray()
我做錯了什么,有人可以幫忙嗎?
這將返回一個列表列表:
# ...
data = data.apply(lambda x: [' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)])
data = data.apply(lambda x: [stemmer.stem(word) for word in x])
return data
而fit_transform
需要一個字符串列表。 我建議這樣編輯:
# ...
data = data.apply(lambda x: ''.join([' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)]))
data = data.apply(lambda x: ''.join([stemmer.stem(word) for word in x]))
return data
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.