简体   繁体   中英

Trying to understand Random forest for regression

I'm trying to understand the random forest for regression. I've read a lot about it already, but I still find it very hard to understand. What I do understand is this: the random forest averages the answers from multiple decision trees. Each decision tree is built using a different sample and a different subset of features. However, there are some things which I still don't quite understand.

  1. If I'm correct, a tree is built using a node splitting algorithm. Is it true that there are different algorithms possible for splitting nodes?
    • I've read for example about the Information Gain and Standard Deviation Reduction.
  2. Is it true that at each node of a decision tree, only one feature is considered?
  3. From what I've read I understood that the decision tree fits the data in a piecewise linear fashion by minimizing the sum of squared errors. Is this correct? And so is each fitted piece in fact a "normal", (multidimensional) linear regression?
  4. How does a random forest make predictions? I understood that when a model is trained, you don't end up with values for the coefficients of the features (compared to, say, linear regression).

Hopefully someone can make this more clear!

  1. Yes, Information Gain and Gini coefficient are two common methods for classification, however for regression a measure of variance is commonly used, for ex MSS.
  2. A split is done on one variable, however choosing the variable is random and depends on the mtry argument (feature bagging).
  3. In a way.
  4. Each tree makes its own prediction based on where the new observation falls in the leaf, the overall prediction is the average over all the trees.
  1. Yes there are different node-splitting criteria (Gini, Information Gain, Entropy, etc.) The choice of which criterion doesn't really matter much (you can show they essentially do the same thing, on all non-pathological distributions, and tend to generate roughly the same splits). Not important compared to other hyperparameters like min samples per node, class weights etc.
  2. Not quite. Most (or all) candidate features are considered for each node during tree construction, but in the end each node only gets one feature (the optimal one, according to the splitting criterion, and the set of candidate features and splits that was exposed to it).
  3. Multiple things:
    • Minimizing the sum of squared errors is not any guarantee of enforcing that the output will have a normal distribution. It's the optimal loss function when the output happens to have a normal distribution, ie it minimizes the output error. Generally it behaves ok as a loss function and is better than MAE; SSE punishes outliers and behaves 'smoothly'.
    • You can use other loss functions than the sum of squared errors. You can use RMSE, logloss, MAE, etc.
    • Well conceptually you can look at a tree or subtree as a poor-man's piecewise approximation to a (continuous) regressor. There's an obvious tension that shallower trees gives you discontinuities, but deeper trees tend to overfit. Essentially we only construct a rough approximation (from multiple variables), where the tree construction (ie node-splitting criterion function) tells us we most need to.
  4. To evaluate (make predictions) using a tree, for each input sample, you just walk the nodes from the root node to the leaf node, following.
    • Yes an RF doesn't have coefficients like linear regression does. It does however have feature importances , which tell you (in aggregate) which features get used how often across all trees.
    • But beware of directly interpreting coefficients from linear regression. That has its own caveats (correlation, etc.).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM