簡體   English   中英

scikit-learn添加訓練數據

[英]scikit-learn add training data

我在這里查看 sklearn中可用的培訓數據。 根據文檔,它基於一些新聞組集合包含20類文檔。 它對屬於這些類別的文檔進行了很好的分類。 但是,我需要為板球,足球,核物理等類別添加更多文章。

我已經為每個課程准備好了一組文檔,例如sports -> cricketcooking -> French等。如何在sklearn添加這些文檔和課程,以便現在返回20個課程的界面將返回這20個課程和新課程還有嗎? 如果我需要通過SVMNaive Bayes一些培訓,那么在將其添加到數據集之前該在哪里做?

假設您的其他數據具有以下目錄結構(如果沒有,那么這應該是您的第一步,因為您可以使用sklearn API來獲取數據,這將使您的工作更加輕松,請參見此處 ):

additional_data
      |
      |-> sports.cricket
                |
                |-> file1.txt
                |-> file2.txt
                |-> ...
      |
      |-> cooking.french
                |
                |-> file1.txt
                |-> ...
       ...

移至python ,加載兩個數據集(假設您的其他數據采用上述格式,並且以/path/to/additional_data為根)

import os

from sklearn import cross_validation
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import load_files
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np

# Note if you have a pre-defined training/testing split in your additional data, you would merge them with the corresponding 'train' and 'test' subsets of 20news
news_data = fetch_20newsgroups(subset='all')
additional_data = load_files(container_path='/path/to/additional_data', encoding='utf-8')

# Both data objects are of type `Bunch` and therefore can be relatively straightforwardly merged

# Merge the two data files
'''
The Bunch object contains the following attributes: `dict_keys(['target_names', 'description', 'DESCR', 'target', 'data', 'filenames'])`
The interesting ones for our purposes are 'data' and 'filenames'
'''
all_filenames = np.concatenate((news_data.filenames, additional_data.filenames)) # filenames is a numpy array
all_data = news_data.data + additional_data.data # data is a standard python list

merged_data_path = '/path/to/merged_data'

'''
The 20newsgroups data has a filename a la '/path/to/scikit_learn_data/20news_home/20news-bydate-test/rec.sport.hockey/54367'
So depending on whether you want to keep the sub directory structure of the train/test splits or not, 
you would either need the last 2 or 3 parts of the path
'''
for content, f in zip(all_data, all_filenames):
    # extract sub path
    sub_path, filename = f.split(os.sep)[-2:]

    # Create output directory if not exists
    p = os.path.join(merged_data_path, sub_path)
    if (not os.path.exists(p)):
        os.makedirs(p)

    # Write data to file
    with open(os.path.join(p, filename), 'w') as out_file:
        out_file.write(content)

# Now that everything is stored at `merged_data_path`, we can use `load_files` to fetch the dataset again, which now includes everything from 20newsgroups and your additional data
all_data = load_files(container_path=merged_data_path)

'''
all_data is yet another `Bunch` object:
    * `data` contains the data
    * `target_names` contains the label names
    * `target contains` the labels in numeric format
    * `filenames` contains the paths of each individual document

thus, running a classifier over the data is straightforward
'''
vec = CountVectorizer()
X = vec.fit_transform(all_data.data)

# We want to create a train/test split for learning and evaluating a classifier (supposing we haven't created a pre-defined train/test split encoded in the directory structure)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, all_data.target, test_size=0.2)

# Create & fit the MNB model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Evaluate Accuracy
y_predicted = mnb.predict(X_test)

print('Accuracy: {}'.format(accuracy_score(y_test, y_predicted)))

# Alternatively, the vectorisation and learning can be packaged into a pipeline and serialised for later use
pipeline = Pipeline([('vec', CountVectorizer()), ('mnb', MultinomialNB())])

# Run the vectorizer and train the classifier on all available data
pipeline.fit(all_data.data, all_data.target)

# Serialise the classifier to disk
joblib.dump(pipeline, '/path/to/model_zoo/mnb_pipeline.joblib')

# If you get some more data later on, you can deserialise the model and run them through the pipeline again
p = joblib.load('/path/to/model_zoo/mnb_pipeline.joblib')

docs_new = ['God is love', 'OpenGL on the GPU is fast']

y_predicted = p.predict(docs_new)
print('Predicted labels: {}'.format(np.array(all_data.target_names)[y_predicted]))    

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM