CountVectorizer 不處理我的文本數據。它不斷給我 AttributeError: 'list' object has no attribute 'lower'

Question

I have created process_textData function that takes in a pandas DataFrame column of text, then performs the following: 1. Convert text to lower case and remove all punctuation 2. Optionally apply stemming 3. Apply Ngram Tokenisation 4. Returns the tokenised text as a list .

import string
from nltk.stem.snowball import SnowballStemmer
from nltk import everygrams, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def process_text(data, n=1):
    stemmer = SnowballStemmer('english')
    data = data.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
    data = data.apply(lambda x: [' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)])
    data = data.apply(lambda x: [stemmer.stem(word) for word in x])
    return data

之后，我將 function 實施到 Sklearn CountVectorizer 中，它給了我這個錯誤：

AttributeError: 'list' object has no attribute 'lower'.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words=None, ngram_range=(3, 3))
X = cv.fit_transform(process_textData(df.news, n=3))
X.toarray()

我做錯了什么，有人可以幫忙嗎？

Answer 1

這將返回一個列表列表：

    # ...
    data = data.apply(lambda x: [' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)])
    data = data.apply(lambda x: [stemmer.stem(word) for word in x])
    return data

而fit_transform需要一個字符串列表。 我建議這樣編輯：

    # ...
    data = data.apply(lambda x: ''.join([' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)]))
    data = data.apply(lambda x: ''.join([stemmer.stem(word) for word in x]))
    return data

CountVectorizer 不處理我的文本數據。它不斷給我 AttributeError: 'list' object has no attribute 'lower'

問題描述

1 個解決方案

解決方案1
0 2021-12-14 14:02:57

CountVectorizer 不處理我的文本數據。 它不斷給我 AttributeError: 'list' object has no attribute 'lower'

問題描述

1 個解決方案

解決方案1 0 2021-12-14 14:02:57

CountVectorizer 不處理我的文本數據。它不斷給我 AttributeError: 'list' object has no attribute 'lower'

解決方案1
0 2021-12-14 14:02:57