简体   繁体   中英

Merge numeric and text features for category classification

I'm trying to classify product items in order to predict their category based on the product title and their base price.

An example(product title, price, category):

['notebook sony vaio vgn-z770td dockstation', 3000.0, u'MLA54559']

Previously I was only using product title for the prediction task but I'd like to include the price to see if the accuracy improves.

The problem with my code is that I can't merge the text/numeric features, I've been reading some questions here in SO and this is my code excerpt:

#extracting features from text
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform([e[0] for e in training_set])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

#extracting numerical features
X_train_price = np.array([e[1] for e in training_set])

X = sparse.hstack([X_train_tfidf, X_train_price]) #this is where the problem begins

clf = svm.LinearSVC().fit(X, [e[2] for e in training_set])

I try to merge the data types with sparse.hstack but I get the following error:

ValueError: blocks[0,:] has incompatible row dimensions

I guess the problem lies in X_train_price(a list of prices) but I don't know how to format it for the sparse function to succesfully work.

These are the shapes of both arrays:

>>> X_train_tfidf.shape
(65845, 23136)
>>>X_train_price.shape
(65845,)

It looks to me like this should be as simple as stacking the arrays. If scikit-learn follows the conventions I'm familiar with, then each row in X_train_tfidf is a training datapoint, and there are a total of 65845 points. So you just have to do an hstack -- as you said you tried to do.

However, you need to make sure the dimensions are compatible! In vanilla numpy you get this error otherwise:

>>> a = numpy.arange(15).reshape(5, 3)
>>> b = numpy.arange(15, 20)
>>> numpy.hstack((a, b))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/
        Extras/lib/python/numpy/core/shape_base.py", line 270, in hstack
    return _nx.concatenate(map(atleast_1d,tup),1)
ValueError: arrays must have same number of dimensions

Reshape b to have the correct dimensions -- noting that a 1-d array of shape (5,) is totally different from a 2-d array of shape (5, 1) .

>>> b
array([15, 16, 17, 18, 19])
>>> b.reshape(5, 1)
array([[15],
       [16],
       [17],
       [18],
       [19]])
>>> numpy.hstack((a, b.reshape(5, 1)))
array([[ 0,  1,  2, 15],
       [ 3,  4,  5, 16],
       [ 6,  7,  8, 17],
       [ 9, 10, 11, 18],
       [12, 13, 14, 19]])

So in your case, you want an array of shape (65845, 1) instead of (65845,) . I might be missing something because you are using sparse arrays. Nonetheless, the principle ought be the same. I have no idea what sparse format you're using based on the above code, so I just picked one to test:

>>> a = scipy.sparse.lil_matrix(numpy.arange(15).reshape(5, 3))
>>> scipy.sparse.hstack((a, b.reshape(5, 1))).toarray()
array([[ 0,  1,  2, 15],
       [ 3,  4,  5, 16],
       [ 6,  7,  8, 17],
       [ 9, 10, 11, 18],
       [12, 13, 14, 19]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM