I need to organize some data from a text file into features for the classifier. I have 3 features to train and I'm having some troubles in understanding what is the correct format for a feature variable.
from sklearn import tree
import os
import re
os.chdir(r"C:\ig_automation")
metrics_to_train = open('metrics_to_train.txt', 'r')
labels_to_train = open('labels_to_train.txt', 'r')
validation_metrics = open('validation_metrics.txt', 'r')
validation_labels = open('validation_labels.txt', 'r')
clf = tree.DecisionTreeClassifier()
features = metrics_to_train.read().replace("\n","").replace(" ","
").split(" ")
print(features)
Output:
['1434.0', '4000000.0', '33.0', '82.0', '39.0', '219.0', '634.0', '5506.0', '58.0', '106.0', '783.0', '332.0', '222.0', '413.0', '197.0', '112.0'......
The data is as follows: feat 1 - number of posts(pos 0 = 1434), feat 2 - followers(pos 1 = 4000000), feat 3 - number of follows(pos 2 = 33) and it repeats until the last value of the list.
I have to train the classifier with this features and get one label.
And also if there is any problem with how I've imported the data, here are some lines from the text file:
1434.0 4000000.0 33.0
82.0 39.0 219.0
634.0 5506.0 58.0
106.0 783.0 332.0
222.0 413.0 197.0
I'm kind of new at ML so, I would really need some advice. Thanks!
You need to transpose the feature matrix.
The reason of this is that all scikit-learn
functions are expecting a matrix X
as input where the rows are the subjects(samples)
and the columns are the `features(variables)?.
From the documentation:
So, transpose the data using numpy
as a fast way:
import numpy as np
features = np.array(features)
X = features.T
clf.fit(X,....)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.