how to control the number of features [machine learning]?

Question

I am writing this machine learning code (classification) to clssify between two classes. I started by having one feature to capture for all my images.

for example: (note: 1 & 0 are for labeling) class A=[(4295046.0, 1), (4998220.0, 1), (4565017.0, 1), (4078291.0, 1), (4350411.0, 1), (4434050.0, 1), (4201831.0, 1), (4203570.0, 1), (4197025.0, 1), (4110781.0, 1), (4080568.0, 1), (4276499.0, 1), (4363551.0, 1), (4241573.0, 1), (4455070.0, 1), (5682823.0, 1), (5572122.0, 1), (5382890.0, 1), (5217487.0, 1), (4714908.0, 1), (4697137.0, 1), (4057898.0, 1), (4143981.0, 1), (3899129.0, 1), (3830584.0, 1), (3557377.0, 1), (3125518.0, 1), (3197039.0, 1), (3109404.0, 1), (3024219.0, 1), (3066759.0, 1), (2726363.0, 1), (3507626.0, 1), .....etc]

class B=[(7179088.0, 0), (7144249.0, 0), (6806806.0, 0), (5080876.0, 0), (5170390.0, 0), (5694876.0, 0), (6210510.0, 0), (5376014.0, 0), (6472171.0, 0), (7112956.0, 0), (7356507.0, 0), (9180030.0, 0), (9183460.0, 0), (9212517.0, 0), (9055663.0, 0), (9053709.0, 0), (9103067.0, 0), (8889903.0, 0), (8328604.0, 0), (8475442.0, 0), (8499221.0, 0), (8752169.0, 0), (8779133.0, 0), (8756789.0, 0), (8990732.0, 0), (9027381.0, 0), (9090035.0, 0), (9343846.0, 0), (9518609.0, 0), (9435149.0, 0), (9365842.0, 0), (9395256.0, 0), (4381880.0, 0), (4749338.0, 0), (5296143.0, 0), (5478942.0, 0), (5610865.0, 0), (5514997.0, 0), (5381010.0, 0), (5090416.0, 0), (4663958.0, 0), (4804526.0, 0), (4743107.0, 0), (4898914.0, 0), (5018503.0, 0), (5778240.0, 0), (5741893.0, 0), (4632926.0, 0), (5208486.0, 0), (5633403.0, 0), (5699410.0, 0), (5748260.0, 0), (5869260.0, 0), ....etc]

/data is A and B combined

x = [[each[0]] for each in data]
y = [[each[1]] for each in data]
print (len(x), len(y))

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, 
random_state=42)
print (len(x_train), len(x_test))
print (len(y_train), len(y_test))

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(x_train, y_train)

Question:

what to change to add another feature? how the A and B should look when adding the feature and do I change this line

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)

when using two features?

My guess:

class A=[(4295046.0,secons features, 1), (4998220.0,secons features, 1), (4565017.0,secons features, 1), (4078291.0,secons features, 1), (4350411.0,secons features, 1), (4434050.0, 1),......] is that right? is there better way?

Answer 1

This model doesn't need explicitly the number of features.
If the class is always the last element in each tuple in the data, you can do:

x = [[each[:-1]] for each in data]
y = [[each[-1]] for each in data]

And just carry on the same from there.

Answer 2

The idea of random forest is that you have lots of simple models which you average. That means no matter how many features you have, your trees should not be too deep. If you have lots of features, and use lots of trees you can try to increase the depth, but in general, for random forests the trees should be shallow. Experiment and try it out!

As an example:

https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d

In this experiment there were +900 data points and nine features. They tested values of max_depth between 1 and 32 and from the results I would say around 5 was the best. But this could be different depending on the data set and features in question.

how to control the number of features [machine learning]?

Question

2 answers

solution1
0 ACCPTED 2019-08-06 07:11:52

solution2
0 2019-08-06 07:28:40

how to control the number of features [machine learning]?

Question

2 answers

solution1 0 ACCPTED 2019-08-06 07:11:52

solution2 0 2019-08-06 07:28:40

solution1
0 ACCPTED 2019-08-06 07:11:52

solution2
0 2019-08-06 07:28:40