简体   繁体   中英

Extract features from text file and train them to classifier

I need to organize some data from a text file into features for the classifier. I have 3 features to train and I'm having some troubles in understanding what is the correct format for a feature variable.

from sklearn import tree
import os
import re

os.chdir(r"C:\ig_automation")
metrics_to_train = open('metrics_to_train.txt', 'r')
labels_to_train = open('labels_to_train.txt', 'r')
validation_metrics = open('validation_metrics.txt', 'r')
validation_labels = open('validation_labels.txt', 'r')

clf = tree.DecisionTreeClassifier()
features = metrics_to_train.read().replace("\n","").replace("   "," 
").split(" ")
print(features)

Output:

['1434.0', '4000000.0', '33.0', '82.0', '39.0', '219.0', '634.0', '5506.0', '58.0', '106.0', '783.0', '332.0', '222.0', '413.0', '197.0', '112.0'......

The data is as follows: feat 1 - number of posts(pos 0 = 1434), feat 2 - followers(pos 1 = 4000000), feat 3 - number of follows(pos 2 = 33) and it repeats until the last value of the list.

I have to train the classifier with this features and get one label.

And also if there is any problem with how I've imported the data, here are some lines from the text file:

1434.0   4000000.0   33.0   
82.0   39.0   219.0   
634.0   5506.0   58.0   
106.0   783.0   332.0   
222.0   413.0   197.0   

I'm kind of new at ML so, I would really need some advice. Thanks!

You need to transpose the feature matrix.

The reason of this is that all scikit-learn functions are expecting a matrix X as input where the rows are the subjects(samples) and the columns are the `features(variables)?.

From the documentation:

在此处输入图片说明

So, transpose the data using numpy as a fast way:

import numpy as np

features = np.array(features)
X = features.T

clf.fit(X,....)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM