简体   繁体   English

如何控制特征的数量[机器学习]?

[英]how to control the number of features [machine learning]?

I am writing this machine learning code (classification) to clssify between two classes.我正在编写此机器学习代码(分类)以在两个类之间进行分类。 I started by having one feature to capture for all my images.我首先使用一项功能来捕获我的所有图像。

for example: (note: 1 & 0 are for labeling) class A=[(4295046.0, 1), (4998220.0, 1), (4565017.0, 1), (4078291.0, 1), (4350411.0, 1), (4434050.0, 1), (4201831.0, 1), (4203570.0, 1), (4197025.0, 1), (4110781.0, 1), (4080568.0, 1), (4276499.0, 1), (4363551.0, 1), (4241573.0, 1), (4455070.0, 1), (5682823.0, 1), (5572122.0, 1), (5382890.0, 1), (5217487.0, 1), (4714908.0, 1), (4697137.0, 1), (4057898.0, 1), (4143981.0, 1), (3899129.0, 1), (3830584.0, 1), (3557377.0, 1), (3125518.0, 1), (3197039.0, 1), (3109404.0, 1), (3024219.0, 1), (3066759.0, 1), (2726363.0, 1), (3507626.0, 1), .....etc]例如:(注:1 & 0 用于标注)class A=[(4295046.0, 1), (4998220.0, 1), (4565017.0, 1), (4078291.0, 1), (4350411.0, 1), (4.0) 1), (4201831.0, 1), (4203570.0, 1), (4197025.0, 1), (4110781.0, 1), (4080568.0, 1), (4276499.0, 1), (4), 5,17.0 , (4455070.0, 1), (5682823.0, 1), (5572122.0, 1), (5382890.0, 1), (5217487.0, 1), (4714908.0, 1), (7,8,10.0, 7), (46,10) 4143981.0, 1), (3899129.0, 1), (3830584.0, 1), (3557377.0, 1), (3125518.0, 1), (3197039.0, 1), (3109.0, 10) (3109.0, 10) (3109.0, 10) (3109.0, 10) 1), (2726363.0, 1), (3507626.0, 1), .....etc]

class B=[(7179088.0, 0), (7144249.0, 0), (6806806.0, 0), (5080876.0, 0), (5170390.0, 0), (5694876.0, 0), (6210510.0, 0), (5376014.0, 0), (6472171.0, 0), (7112956.0, 0), (7356507.0, 0), (9180030.0, 0), (9183460.0, 0), (9212517.0, 0), (9055663.0, 0), (9053709.0, 0), (9103067.0, 0), (8889903.0, 0), (8328604.0, 0), (8475442.0, 0), (8499221.0, 0), (8752169.0, 0), (8779133.0, 0), (8756789.0, 0), (8990732.0, 0), (9027381.0, 0), (9090035.0, 0), (9343846.0, 0), (9518609.0, 0), (9435149.0, 0), (9365842.0, 0), (9395256.0, 0), (4381880.0, 0), (4749338.0, 0), (5296143.0, 0), (5478942.0, 0), (5610865.0, 0), (5514997.0, 0), (5381010.0, 0), (5090416.0, 0), (4663958.0, 0), (4804526.0, 0), (4743107.0, 0), (4898914.0, 0), (5018503.0, 0), (5778240.0, 0), (5741893.0, 0), (4632926.0, 0), (5208486.0, 0), (5633403.0, 0), (5699410.0, 0), (5748260.0, 0), (5869260.0, 0), ....etc] B类=[(7179088.0, 0), (7144249.0, 0), (6806806.0, 0), (5080876.0, 0), (5170390.0, 0), (5694876.0, 0), (6), 7, 02, (6), 7,02 ), (6472171.0, 0), (7112956.0, 0), (7356507.0, 0), (9180030.0, 0), (9183460.0, 0), (9212517.0, 0), (9212517.0, 0), (3), 5, 905 (905) (9103067.0, 0), (8889903.0, 0), (8328604.0, 0), (8475442.0, 0), (8499221.0, 0), (8752169.0, 0), (87,89,709) (87,890.70) , 0), (9027381.0, 0), (9090035.0, 0), (9343846.0, 0), (9518609.0, 0), (9435149.0, 0), (9365842.0, 0.3, 0.0), (9), 86, 0.5 ), (4749338.0, 0), (5296143.0, 0), (5478942.0, 0), (5610865.0, 0), (5514997.0, 0), (5381010.0, 0), (5381010.0, 0), (5478942.0, 0), 6.09. (4804526.0, 0), (4743107.0, 0), (4898914.0, 0), (5018503.0, 0), (5778240.0, 0), (5741893.0, 0), (46,507.0) (46,503.0, 0), (46,507.0) (463220) , 0), (5699410.0, 0), (5748260.0, 0), (5869260.0, 0), ....etc]

/data is A and B combined

x = [[each[0]] for each in data]
y = [[each[1]] for each in data]
print (len(x), len(y))

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, 
random_state=42)
print (len(x_train), len(x_test))
print (len(y_train), len(y_test))

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(x_train, y_train)

Question:题:

what to change to add another feature?更改什么以添加另一个功能? how the A and B should look when adding the feature and do I change this line添加功能时 A 和 B 应该如何显示,我是否更改此行

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)

when using two features?何时使用两个功能?

My guess:我猜:

class A=[(4295046.0,secons features, 1), (4998220.0,secons features, 1), (4565017.0,secons features, 1), (4078291.0,secons features, 1), (4350411.0,secons features, 1), (4434050.0, 1),......] is that right? A类=[(4295046.0,secons features, 1), (4998220.0,secons features, 1), (4565017.0,secons features, 1), (4078291.0,secons features, 1), (4350411.0,secons features, 1) 4434050.0, 1),......] 是吗? is there better way?有更好的方法吗?

This model doesn't need explicitly the number of features.这个模型不需要明确的特征数量。
If the class is always the last element in each tuple in the data, you can do:如果类始终是数据中每个元组中的最后一个元素,则可以执行以下操作:

x = [[each[:-1]] for each in data]
y = [[each[-1]] for each in data]

And just carry on the same from there.并从那里继续相同。

The idea of random forest is that you have lots of simple models which you average.随机森林的想法是你有很多你平均的简单模型。 That means no matter how many features you have, your trees should not be too deep.这意味着无论你有多少特征,你的树都不应该太深。 If you have lots of features, and use lots of trees you can try to increase the depth, but in general, for random forests the trees should be shallow.如果你有很多特征,并且使用很多树,你可以尝试增加深度,但一般来说,对于随机森林,树应该是浅的。 Experiment and try it out!试验并尝试一下!

As an example:举个例子:

https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d https://medium.com/all-things-ai/in-depth-parameter-tuning-for-random-forest-d67bb7e920d

In this experiment there were +900 data points and nine features.在这个实验中,有 +900 个数据点和 9 个特征。 They tested values of max_depth between 1 and 32 and from the results I would say around 5 was the best.他们测试了 1 到 32 之间的 max_depth 值,从结果来看,我认为 5 左右是最好的。 But this could be different depending on the data set and features in question.但这可能会有所不同,具体取决于所讨论的数据集和特征。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM