Extract features from text file and train them to classifier

Question

I need to organize some data from a text file into features for the classifier. I have 3 features to train and I'm having some troubles in understanding what is the correct format for a feature variable.

from sklearn import tree
import os
import re

os.chdir(r"C:\ig_automation")
metrics_to_train = open('metrics_to_train.txt', 'r')
labels_to_train = open('labels_to_train.txt', 'r')
validation_metrics = open('validation_metrics.txt', 'r')
validation_labels = open('validation_labels.txt', 'r')

clf = tree.DecisionTreeClassifier()
features = metrics_to_train.read().replace("\n","").replace("   "," 
").split(" ")
print(features)

Output:

['1434.0', '4000000.0', '33.0', '82.0', '39.0', '219.0', '634.0', '5506.0', '58.0', '106.0', '783.0', '332.0', '222.0', '413.0', '197.0', '112.0'......

The data is as follows: feat 1 - number of posts(pos 0 = 1434), feat 2 - followers(pos 1 = 4000000), feat 3 - number of follows(pos 2 = 33) and it repeats until the last value of the list.

I have to train the classifier with this features and get one label.

And also if there is any problem with how I've imported the data, here are some lines from the text file:

1434.0   4000000.0   33.0   
82.0   39.0   219.0   
634.0   5506.0   58.0   
106.0   783.0   332.0   
222.0   413.0   197.0

I'm kind of new at ML so, I would really need some advice. Thanks!

Answer 1

You need to transpose the feature matrix.

The reason of this is that all scikit-learn functions are expecting a matrix X as input where the rows are the subjects(samples) and the columns are the `features(variables)?.

From the documentation:

So, transpose the data using numpy as a fast way:

import numpy as np

features = np.array(features)
X = features.T

clf.fit(X,....)

Extract features from text file and train them to classifier

Question

1 answers

solution1
0 ACCPTED 2018-05-22 14:05:57

Extract features from text file and train them to classifier

Question

1 answers

solution1 0 ACCPTED 2018-05-22 14:05:57

solution1
0 ACCPTED 2018-05-22 14:05:57