Hold out sample when loading data in Scikit-Learn with sklearn.datasets.load_files

Question

I'm experimenting with a simple Naive Bayes with Scikit-learn.

Essentially, I've got two folders, respectively named Cat A and Cat B, each of which consisting of circa 1,500 text files.

I'm loading these files in order to train the classifier like so:

# Declare the categories
categories = ['CatA', 'CatB']

# Load the dataset
docs_to_train = sklearn.datasets.load_files("/Users/dh/Documents/Development/Python/Test_Data", description=None, categories=categories, load_content=True, shuffle=True, encoding='utf-8', decode_error='strict', random_state=0)

I'm testing the classifier with short strings of text, eg

docs_new = ['This is test string 1.', 'This is test string 2.', 'This is test string 3.']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, docs_to_train.target_names[category]))

Everything works as it ought to, but what I'd really like to do is test the classifier on some data that closely resembles to training data. Ideally, I'd like to carve out a hold out sample within the data I'm using the train the classifier and then cross-validate with that.

I suppose I could just move 500-odd documents from each of the training datasets into a different folders, but I was wondering whether there's a better way to create the hold out sample?

The documentation doesn't appear to offer on guidance with this.

The code in full follows.

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn import datasets
from pprint import pprint

# Declare the categories
categories = ['CatA', 'CatB']

# Load the dataset
docs_to_train = sklearn.datasets.load_files("/Users/dh/Documents/Development/Python/Test_Data", description=None, categories=categories, load_content=True, shuffle=True, encoding='utf-8', decode_error='strict', random_state=0)

print len(docs_to_train.data)

# Vectorise the dataset

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(docs_to_train.data)

# Fit the estimator and transform the vector to tf-idf

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print X_train_tfidf.shape

clf = MultinomialNB().fit(X_train_tfidf, docs_to_train.target)

docs_new = ['I am test string 1.', 'I am test string 2', 'I am test string 3']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, docs_to_train.target_names[category]))

Answer 1

What you're looking for is referred to as a "train-test split."

Use sklearn.model_selection.train_test_split :

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(docs_to_train.data, 
                               docs_to_train.target,
                               test_size = 500)

Hold out sample when loading data in Scikit-Learn with sklearn.datasets.load_files

Question

1 answers

solution1
1 ACCPTED 2017-06-02 16:27:16

Hold out sample when loading data in Scikit-Learn with sklearn.datasets.load_files

Question

1 answers

solution1 1 ACCPTED 2017-06-02 16:27:16

solution1
1 ACCPTED 2017-06-02 16:27:16