简体   繁体   English

合并随机森林树时出现意外异常

[英]Unexpected exception when combining random forest trees

Using the information described in this question, Combining random forest models in scikit learn ,I have attempted to combine several random forest classifiers into a single classifier using python2.7.10 and sklearn 0.16.1, but get this exception in some cases: 使用此问题中描述的信息(在scikit学习中组合随机森林模型) ,我尝试使用python2.7.10和sklearn 0.16.1将多个随机森林分类器组合为单个分类器,但在某些情况下会出现此异常:

    Traceback (most recent call last):
      File "sktest.py", line 50, in <module>
        predict(rf)
      File "sktest.py", line 46, in predict
        Y = rf.predict(X)
      File "/python-2.7.10/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 462, in predict
        proba = self.predict_proba(X)
      File "/python-2.7.10/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 520, in predict_proba
        proba += all_proba[j]
    ValueError: non-broadcastable output operand with shape (39,1) doesn't match the broadcast shape (39,2)

The application is to create a number of random forest classifiers on many processors and combine these objects into a single classifier available to all processors. 该应用程序将在许多处理器上创建多个随机森林分类器,并将这些对象组合为一个可用于所有处理器的单个分类器。

The test code to produce this exception is shown below, it creates 5 classifiers with a random number of arrays of 10 features. 产生此异常的测试代码如下所示,它创建5个分类器,并随机包含10个特征的数组。 If yfrac is changed to 0.5, the code will not give an exception. 如果yfrac更改为0.5,则代码不会给出异常。 Is this a valid method of combining classifier objects? 这是组合分类器对象的有效方法吗? Also, this same exception is created when using warm_start to add trees to an existing RandomForestClassifier when n_estimators is increased and data added via fit. 同样,当增加n_estimators并通过fit添加数据时,使用warm_start将树添加到现有的RandomForestClassifier时,也会创建此相同的异常。

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from numpy import zeros,random,logical_or,where,array

random.seed(1) 

def generate_rf(X_train, y_train, X_test, y_test, numTrees=50):
  rf = RandomForestClassifier(n_estimators=numTrees, n_jobs=-1)
  rf.fit(X_train, y_train)
  print "rf score ", rf.score(X_test, y_test)
  return rf

def combine_rfs(rf_a, rf_b):
  rf_a.estimators_ += rf_b.estimators_
  rf_a.n_estimators = len(rf_a.estimators_)
  return rf_a

def make_data(ndata, yfrac=0.5):
  nx = int(random.uniform(10,100))

  X = zeros((nx,ndata))
  Y = zeros(nx)

  for n in range(ndata):
    rnA = random.random()*10**(random.random()*5)
    X[:,n] = random.uniform(-rnA,rnA, nx)
    Y = logical_or(Y,where(X[:,n] > yfrac*rnA, 1.,0.))

  return X, Y

def train(ntrain=5, ndata=10, test_frac=0.2, yfrac=0.5):
  rfs = []
  for u in range(ntrain):
    X, Y = make_data(ndata, yfrac=yfrac)

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_frac)

    #Train the random forest and add to list
    rfs.append(generate_rf(X_train, Y_train, X_test, Y_test))

  # Combine the block classifiers into a single classifier
  return reduce(combine_rfs, rfs)

def predict(rf, ndata=10):
  X, Y = make_data(ndata)
  Y = rf.predict(X)

if __name__ == "__main__":
  rf = train(yfrac = 0.42)
  predict(rf)

Your first RandomForest only gets positive cases, while other RandomForests get both cases. 您的第一个RandomForest只得到肯定的情况,而其他RandomForests都得到这两个情况。 As a result, their DecisionTree results are incompatible with each other. 结果,它们的DecisionTree结果彼此不兼容。 Run your code with this replaced train() function: 使用此替换的train()函数运行代码:

def train(ntrain=5, ndata=10, test_frac=0.2, yfrac=0.5):
  rfs = []
  for u in range(ntrain):
    X, Y = make_data(ndata, yfrac=yfrac)

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_frac)

    assert Y_train.sum() != 0
    assert Y_train.sum() != len( Y_train )
    #Train the random forest and add to list
    rfs.append(generate_rf(X_train, Y_train, X_test, Y_test))

  # Combine the block classifiers into a single classifier
  return reduce(combine_rfs, rfs)

Use a StratifiedShuffleSplit cross-validation generator rather than train_test_split, and check to make sure each RF gets both (all) classes in the training set. 使用StratifiedShuffleSplit交叉验证生成器而不是train_test_split,并进行检查以确保每个RF都获得训练集中的两个(所有)类。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM