簡體   English   中英

CountVectorizer 不處理我的文本數據。 它不斷給我 AttributeError: 'list' object has no attribute 'lower'

[英]CountVectorizer does not process my text data. It keep giving me AttributeError: 'list' object has no attribute 'lower'

I have created process_textData function that takes in a pandas DataFrame column of text, then performs the following: 1. Convert text to lower case and remove all punctuation 2. Optionally apply stemming 3. Apply Ngram Tokenisation 4. Returns the tokenised text as a list .

import string
from nltk.stem.snowball import SnowballStemmer
from nltk import everygrams, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def process_text(data, n=1):
    stemmer = SnowballStemmer('english')
    data = data.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
    data = data.apply(lambda x: [' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)])
    data = data.apply(lambda x: [stemmer.stem(word) for word in x])
    return data

之后,我將 function 實施到 Sklearn CountVectorizer 中,它給了我這個錯誤:

AttributeError: 'list' object has no attribute 'lower'.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words=None, ngram_range=(3, 3))
X = cv.fit_transform(process_textData(df.news, n=3))
X.toarray()

我做錯了什么,有人可以幫忙嗎?

這將返回一個列表列表:

    # ...
    data = data.apply(lambda x: [' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)])
    data = data.apply(lambda x: [stemmer.stem(word) for word in x])
    return data

fit_transform需要一個字符串列表。 我建議這樣編輯:

    # ...
    data = data.apply(lambda x: ''.join([' '.join(ng).lower() for ng in everygrams(word_tokenize(x),n,n)]))
    data = data.apply(lambda x: ''.join([stemmer.stem(word) for word in x]))
    return data

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM