scikit-learn添加訓練數據

Question

我在這里查看 sklearn中可用的培訓數據。 根據文檔，它基於一些新聞組集合包含20類文檔。 它對屬於這些類別的文檔進行了很好的分類。 但是，我需要為板球，足球，核物理等類別添加更多文章。

我已經為每個課程准備好了一組文檔，例如sports -> cricket ， cooking -> French等。如何在sklearn添加這些文檔和課程，以便現在返回20個課程的界面將返回這20個課程和新課程還有嗎？ 如果我需要通過SVM或Naive Bayes一些培訓，那么在將其添加到數據集之前該在哪里做？

Answer 1

假設您的其他數據具有以下目錄結構（如果沒有，那么這應該是您的第一步，因為您可以使用sklearn API來獲取數據，這將使您的工作更加輕松，請參見此處）：

additional_data
      |
      |-> sports.cricket
                |
                |-> file1.txt
                |-> file2.txt
                |-> ...
      |
      |-> cooking.french
                |
                |-> file1.txt
                |-> ...
       ...

移至python ，加載兩個數據集（假設您的其他數據采用上述格式，並且以/path/to/additional_data為根）

import os

from sklearn import cross_validation
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import load_files
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np

# Note if you have a pre-defined training/testing split in your additional data, you would merge them with the corresponding 'train' and 'test' subsets of 20news
news_data = fetch_20newsgroups(subset='all')
additional_data = load_files(container_path='/path/to/additional_data', encoding='utf-8')

# Both data objects are of type `Bunch` and therefore can be relatively straightforwardly merged

# Merge the two data files
'''
The Bunch object contains the following attributes: `dict_keys(['target_names', 'description', 'DESCR', 'target', 'data', 'filenames'])`
The interesting ones for our purposes are 'data' and 'filenames'
'''
all_filenames = np.concatenate((news_data.filenames, additional_data.filenames)) # filenames is a numpy array
all_data = news_data.data + additional_data.data # data is a standard python list

merged_data_path = '/path/to/merged_data'

'''
The 20newsgroups data has a filename a la '/path/to/scikit_learn_data/20news_home/20news-bydate-test/rec.sport.hockey/54367'
So depending on whether you want to keep the sub directory structure of the train/test splits or not, 
you would either need the last 2 or 3 parts of the path
'''
for content, f in zip(all_data, all_filenames):
    # extract sub path
    sub_path, filename = f.split(os.sep)[-2:]

    # Create output directory if not exists
    p = os.path.join(merged_data_path, sub_path)
    if (not os.path.exists(p)):
        os.makedirs(p)

    # Write data to file
    with open(os.path.join(p, filename), 'w') as out_file:
        out_file.write(content)

# Now that everything is stored at `merged_data_path`, we can use `load_files` to fetch the dataset again, which now includes everything from 20newsgroups and your additional data
all_data = load_files(container_path=merged_data_path)

'''
all_data is yet another `Bunch` object:
    * `data` contains the data
    * `target_names` contains the label names
    * `target contains` the labels in numeric format
    * `filenames` contains the paths of each individual document

thus, running a classifier over the data is straightforward
'''
vec = CountVectorizer()
X = vec.fit_transform(all_data.data)

# We want to create a train/test split for learning and evaluating a classifier (supposing we haven't created a pre-defined train/test split encoded in the directory structure)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, all_data.target, test_size=0.2)

# Create & fit the MNB model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Evaluate Accuracy
y_predicted = mnb.predict(X_test)

print('Accuracy: {}'.format(accuracy_score(y_test, y_predicted)))

# Alternatively, the vectorisation and learning can be packaged into a pipeline and serialised for later use
pipeline = Pipeline([('vec', CountVectorizer()), ('mnb', MultinomialNB())])

# Run the vectorizer and train the classifier on all available data
pipeline.fit(all_data.data, all_data.target)

# Serialise the classifier to disk
joblib.dump(pipeline, '/path/to/model_zoo/mnb_pipeline.joblib')

# If you get some more data later on, you can deserialise the model and run them through the pipeline again
p = joblib.load('/path/to/model_zoo/mnb_pipeline.joblib')

docs_new = ['God is love', 'OpenGL on the GPU is fast']

y_predicted = p.predict(docs_new)
print('Predicted labels: {}'.format(np.array(all_data.target_names)[y_predicted]))

scikit-learn添加訓練數據

問題描述

1 個解決方案

解決方案1
2 已采納 2016-07-27 08:32:46

scikit-learn添加訓練數據

問題描述

1 個解決方案

解決方案1 2 已采納 2016-07-27 08:32:46

解決方案1
2 已采納 2016-07-27 08:32:46