简体   繁体   中英

I would like to consider a feature set(vector) for a data in python for my machine learning algorithm. How can I do it?

I have data in the following form

   Class           Feature set list
   classlabel1 -    [size,time]      example:[6780.3,350.00]
   classlabel2 -    [size,time]
   classlabel3 -    [size,time]
   classlabel4 -    [size,time]

How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.

I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.

The dataframe is getting saved in csv file in the following way:

col 0    col1        col2
62309   396.5099154  label1

I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?

Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, ie

# data.csv
size      time      label
6780.3    3,350.00  classLabel1
...

If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.

import pandas as pd
import ast

df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]

size = [x[0] for x in size_time]                                                                                                                                                                          
time = [x[1] for x in size_time]                                                                                                                                                                          
label = df["Class"]  

new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
#   size  time        label
# 6780.3 350.0  classlabel1

# Save DataFrame to csv
new_df.to_csv("data_fix.csv")

# Use it
x = new_df.drop("label", axis=1)
y = new_df.label

# Further data preparation, such as split the dataset
# into train and test set, etc.
...

Hope this helps

Firstly responding to your question:

I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?

Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.

Now let's move onto to next section:

How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.

  1. Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.

sample_data.csv

 size,time,class_label 100,150,label1 200,250,label2 240,180,label1 

Below is the code for reading the data from csv and training SVM :

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
    warn_bad_lines=True)

# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values

# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))

# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)

# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train) 

# Using the predictor
y_pred = clf.predict(x_test)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM