Scikit使用核心外學習多標簽分類

Question

我是Scikit Learn的新手，我正在從事一個涉及約70000個網頁〜250MB文件的多標簽分類的項目。 由於文件的大小，我不得不使用核心分類。 這些頁面的標簽是dmoz類別。 因此，每個頁面可以有多個標簽。

我通過改寫scikit-learn的核心示例來創建以下代碼。 但是，以下代碼為每個文檔僅打印一個標簽。

1）是否有某種方式可以按概率為每個文檔打印前5個標簽？ 我將對代碼的任何指針/修改表示贊賞。

2）鑒於OneVsRest不提供partial_fit方法，什么將是支持此任務的多標簽分類的良好分類器？

file_training_combined.csv中的文本如下所示

"http://home.earthlink.net/~rvbears/","RV Resources - Camping Information - RV Accessories","","","","","RV Resources - Camping Information - RV Accessories RV Resources\, Camping Resources\, Camping Information  RV\, Camping Resources and Information! For Campers\, Travel Trailers\, Motorhome and Fifth Wheels Owners  Camping Games  Camping Recipes  Camping Cooking Supplies  RV Books  RV E-Books  RV Videos/DVD  RV Links   Looking for rv and camping information\, this is it! Check in here for lots of great resources and information especially for newbies. From Camping Gear\, to RV Books\, E-Books\, and Videos our pages are filled with information about everything to do with Camping and RVing to get you headed in the right direction\, from companies you can trust. Refer to the RV Links section for lots of camping gear and rv accessories\, find just about anything that you are looking for. Coming Back Soon....Our ""PRODUCT REVIEWS BLOG"" Will we be returning to reviewing our best bets on some of the newest camping gadgets for inside and outside your rv or tent.      Emergency medical & travel assistance for less than 22 cents a day. Good Sam TravelAssist. Learn More! With over 2 million rescues and recoveries and counting\, Good Sam Roadside Assistance gives our members peace of mind when they travel.  RV Accessories\, RV Decor\, RV Books\, RV E-books\, RV Videos\, RV DVDs RV Resources\, Camping Resources\, Camping Information NOTE: RV Ladders Bears are now SOLD OUT Home | Woodworking Links | Link To Us Copyright  2002-2014 GoCampin'. All Rights Reserved. Go Campin' ~ PO BOX 25417 ~ Greenville\, SC 29616-0417","/Top/Shopping/Crafts/Woodcraft/Decorative|/Top/Shopping/Crafts/Woodcraft/HomeDecor"

這只是CSV文件的一行。 我正在使用第6列中的文本，第7列中的標簽由|分隔。

import codecs
import itertools
import time
import csv
import sys
import re

from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

__author__ = 'prateek.jain'

csv.field_size_limit(sys.maxsize)

sep = b","
quote_char = b'"'

stop = stopwords.words('english')
porter = PorterStemmer()

text_rows = []

text_labels = []

training_file_object = codecs.open('file_training_combined.csv','r', 'utf-8')
wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)

output_file = 'output.csv'
output_file_object = open(output_file, 'w')

for row in wr1:
    text_rows.append(row[6])
    labels = row[7].strip().split('|')
    empty_list = []
    for label in labels:
        if not ('http:' in label.lower() or 'www:' in label.lower()):
            empty_list.append(label)
    text_labels.append(empty_list)


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text


# dialect='excel'
def stream_docs(path):
    training_file_object = codecs.open(path, 'r', 'utf-8')
    wr1 = csv.reader(training_file_object, dialect='excel', quotechar=quote_char, quoting=csv.QUOTE_ALL, delimiter=sep)
    print(wr1.next())
    for row in wr1:
        text, label = row[6], row[7]
        labels = label.split('|')
        empty_list = []
        for label in labels:
            if not ('http:' in label.lower() or 'www:' in label.lower()):
                empty_list.append(label)
        yield text, empty_list


def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y


from sklearn.feature_extraction.text import HashingVectorizer

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2 ** 10,
                         preprocessor=None,
                         lowercase=True,
                         tokenizer=tokenizer,
                         non_negative=True, )


clf = MultinomialNB()
doc_stream = stream_docs(path='file_training_combined.csv')





merged = list(itertools.chain(*text_labels))
my_set = set(merged)

class_label_list = list(my_set)
all_class_labels = np.array(class_label_list)
mlb = MultiLabelBinarizer(all_class_labels)

X_test_text, y_test = get_minibatch(doc_stream, 1000)

X_test = vect.transform(X_test_text)

classes = np.array([0, 1])
tick = time.time()
accuracy = 0
total_fit_time = 0
n_train_pos = 0
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train_matrix = vect.fit_transform(X_train)
    y_train = mlb.fit_transform(y_train)
    print X_train_matrix.shape, ' ', y_train.shape
    clf.partial_fit(X_train_matrix.toarray(), y_train, classes=all_class_labels)
    total_fit_time += time.time() - tick
    n_train = X_train_matrix.shape[0]
    n_train_pos += sum(y_train)
    tick = time.time()

predicted = clf.predict(X_test)
all_labels = predicted


for item, labels in zip(X_train, all_labels):
    print '%s => %s' % (item, labels)
    output_file_object.write('%s => %s' % (item, labels) + '\n')

Answer 1

僅有250mb，實際上沒有理由超出核心。 還是您的Ram少於250mb？ 要獲得前k個預測，可以使用predict_proba或decision_function來查找每個標簽的可能性。

Scikit使用核心外學習多標簽分類

問題描述

1 個解決方案

解決方案1
2 2015-06-11 15:19:19

Scikit使用核心外學習多標簽分類

問題描述

1 個解決方案

解決方案1 2 2015-06-11 15:19:19

解決方案1
2 2015-06-11 15:19:19