简体   繁体   中英

Pandas DataFrame incompatible with sci-kit fit_transform() function

So I have created a classifier that distinguishes fraudulent messages from genuine messages. Snippet of the code is as follows:

# Import training set as DataFrame from CSV
dataset = pd.read_csv('data.csv', sep=',')
class_names = { 1: 'no-flag', 2: 'flag' }

# Separate training data to message, class pairs
X_train, y_train = dataset.iloc[:,0], dataset.iloc[:, 1]

messages = pd.read_csv('messages.csv', header=None)
X_predict = messages.iloc[:,0]

print "TRAIN:\n"
print type(X_train)
print "PREDICT:\n"
print type(X_predict)

# Vectorise text data
vect = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, preprocessor=sanitise_message)
X_train_tfidf = vect.fit_transform(X_train)
X_predict_tfidf = vect.transform(X_predict)

I used to run this with ten-fold cross validation on the training set, using:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

And that used to work fine. Now I want to use the entire training set as training data, and predict unclassified data. However, the call to X_predict_tfidf = vect.transform(X_predict) throws an error, as follows:

Traceback (most recent call last):
File "post-test.py", line 3, in <module>
classify()
File "/Users/user/Documents/MyTutor/mi_datawarehouse/classifier.py", line 90, in classify
X_predict_tfidf = vect.transform(X_predict)
File "/Users/user/miniconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1409, in transform
X = super(TfidfVectorizer, self).transform(raw_documents)
File "/Users/user/miniconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 923, in transform
_, X = self._count_vocab(raw_documents, fixed_vocab=True)
File "/Users/user/miniconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):
File "/Users/user/miniconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/Users/user/miniconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 119, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode string.

What is interesting is that the types of both X_train and X_predict are identical:

TRAIN:
<class 'pandas.core.series.Series'>
PREDICT:
<class 'pandas.core.series.Series'>

What am I doing wrong? I've been going crazy over this as I've looked everywhere, including the scikit-learn docs.

NOTE: this is NOT a duplicate of a similar question , I have tried everything in that question and nothing worked. The data structures and problem are slightly different.

quick fix might be to remove NaN. Try messages.dropna()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM