简体   繁体   English

Scikit使用核心外学习多标签分类

[英]Scikit Learn Multilabel Classification Using Out Of Core

I am new to Scikit Learn and for work I am working on a project involving multilabel classification on about 70000 webpages ~250MB file. 我是Scikit Learn的新手,我正在从事一个涉及约70000个网页〜250MB文件的多标签分类的项目。 Due to the size of the file, I have to use out of core classification. 由于文件的大小,我不得不使用核心分类。 The labels for these pages are dmoz categories. 这些页面的标签是dmoz类别。 Therefore, each page can have multiple labels. 因此,每个页面可以有多个标签。

I created the code below by adapting from the out of core example for scikit-learn. 我通过改写scikit-learn的核心示例来创建以下代码。 However, the code below prints out only one label for each document. 但是,以下代码为每个文档仅打印一个标签。

1)Is there someway in which I can print the top 5 labels for each document by probability? 1)是否有某种方式可以按概率为每个文档打印前5个标签? I will appreciate any pointers/modifications to the code. 我将对代码的任何指针/修改表示赞赏。

2)What will be a good classifier which supports multilabel classification for this task, given OneVsRest doesn't provide a partial_fit method 2)鉴于OneVsRest不提供partial_fit方法,什么将是支持此任务的多标签分类的良好分类器?

The text inside file_training_combined.csv looks like the following file_training_combined.csv中的文本如下所示

"http://home.earthlink.net/~rvbears/","RV Resources - Camping Information - RV Accessories","","","","","RV Resources - Camping Information - RV Accessories RV Resources\, Camping Resources\, Camping Information  RV\, Camping Resources and Information! For Campers\, Travel Trailers\, Motorhome and Fifth Wheels Owners  Camping Games  Camping Recipes  Camping Cooking Supplies  RV Books  RV E-Books  RV Videos/DVD  RV Links   Looking for rv and camping information\, this is it! Check in here for lots of great resources and information especially for newbies. From Camping Gear\, to RV Books\, E-Books\, and Videos our pages are filled with information about everything to do with Camping and RVing to get you headed in the right direction\, from companies you can trust. Refer to the RV Links section for lots of camping gear and rv accessories\, find just about anything that you are looking for. Coming Back Soon....Our ""PRODUCT REVIEWS BLOG"" Will we be returning to reviewing our best bets on some of the newest camping gadgets for inside and outside your rv or tent.      Emergency medical & travel assistance for less than 22 cents a day. Good Sam TravelAssist. Learn More! With over 2 million rescues and recoveries and counting\, Good Sam Roadside Assistance gives our members peace of mind when they travel.  RV Accessories\, RV Decor\, RV Books\, RV E-books\, RV Videos\, RV DVDs RV Resources\, Camping Resources\, Camping Information NOTE: RV Ladders Bears are now SOLD OUT Home | Woodworking Links | Link To Us Copyright  2002-2014 GoCampin'. All Rights Reserved. Go Campin' ~ PO BOX 25417 ~ Greenville\, SC 29616-0417","/Top/Shopping/Crafts/Woodcraft/Decorative|/Top/Shopping/Crafts/Woodcraft/HomeDecor"

This is just one line from a CSV file. 这只是CSV文件的一行。 I am using the text which is in column 6 and labels are in column 7 seperated by | 我正在使用第6列中的文本,第7列中的标签由|分隔。

import codecs
import itertools
import time
import csv
import sys
import re

from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

__author__ = 'prateek.jain'

csv.field_size_limit(sys.maxsize)

sep = b","
quote_char = b'"'

stop = stopwords.words('english')
porter = PorterStemmer()

text_rows = []

text_labels = []

training_file_object = codecs.open('file_training_combined.csv','r', 'utf-8')
wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)

output_file = 'output.csv'
output_file_object = open(output_file, 'w')

for row in wr1:
    text_rows.append(row[6])
    labels = row[7].strip().split('|')
    empty_list = []
    for label in labels:
        if not ('http:' in label.lower() or 'www:' in label.lower()):
            empty_list.append(label)
    text_labels.append(empty_list)


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text


# dialect='excel'
def stream_docs(path):
    training_file_object = codecs.open(path, 'r', 'utf-8')
    wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)
    print(wr1.next())
    for row in wr1:
        text, label = row[6], row[7]
        labels = label.split('|')
        empty_list = []
        for label in labels:
            if not ('http:' in label.lower() or 'www:' in label.lower()):
                empty_list.append(label)
        yield text, empty_list


def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y


from sklearn.feature_extraction.text import HashingVectorizer

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2 ** 10,
                         preprocessor=None,
                         lowercase=True,
                         tokenizer=tokenizer,
                         non_negative=True, )


clf = MultinomialNB()
doc_stream = stream_docs(path='file_training_combined.csv')





merged = list(itertools.chain(*text_labels))
my_set = set(merged)

class_label_list = list(my_set)
all_class_labels = np.array(class_label_list)
mlb = MultiLabelBinarizer(all_class_labels)

X_test_text, y_test = get_minibatch(doc_stream, 1000)

X_test = vect.transform(X_test_text)

classes = np.array([0, 1])
tick = time.time()
accuracy = 0
total_fit_time = 0
n_train_pos = 0
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train_matrix = vect.fit_transform(X_train)
    y_train = mlb.fit_transform(y_train)
    print X_train_matrix.shape, ' ', y_train.shape
    clf.partial_fit(X_train_matrix.toarray(), y_train, classes=all_class_labels)
    total_fit_time += time.time() - tick
    n_train = X_train_matrix.shape[0]
    n_train_pos += sum(y_train)
    tick = time.time()

predicted = clf.predict(X_test)
all_labels = predicted


for item, labels in zip(X_train, all_labels):
    print '%s => %s' % (item, labels)
    output_file_object.write('%s => %s' % (item, labels) + '\n')

With only 250mb there is really no reason to go out of core. 仅有250mb,实际上没有理由超出核心。 Or do you have less than 250mb of ram? 还是您的Ram少于250mb? For getting the top k predictions, you can use predict_proba or decision_function to get find how likely each label is. 要获得前k个预测,可以使用predict_proba或decision_function来查找每个标签的可能性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 具有特征选择功能的多标签分类(scikit-learn) - Multilabel Classification with Feature Selection (scikit-learn) Scikit Learn Multilabel分类:ValueError:您似乎正在使用传统的多标签数据表示 - Scikit Learn Multilabel Classification: ValueError: You appear to be using a legacy multi-label data representation scikit-learn 中的多标签分类与超参数搜索:指定平均 - Multilabel classification in scikit-learn with hyperparameter search: specifying averaging Scikit-learn的管道:多标签分类出错。 稀疏矩阵通过 - Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed scikit-learn-使用svm.svc分类器进行多标签分类,没有概率=真有可能吗? - scikit-learn - making multilabel classification with svm.svc classifier, is it possible without probability=True? Scikit 学习分类 - Scikit learn-Classification scikit如何学习找出用于分类或回归的逻辑回归 - how scikit learn figure out logistic regression for classification or regression 使用Scikit Learn SVM准备用于文本分类的数据 - Prepare data for text classification using Scikit Learn SVM 使用Scikit Learn的SVM多类分类-代码未完成 - SVM Multiclass Classification using Scikit Learn - Code not completing 如何使用scikit-learn对二进制数据集进行分类? - How to do classification in binary data set using scikit-learn?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM