简体   繁体   English

构造决策树时没有示例的情况

[英]Case of No examples left while constructing a Decision Tree

I was reading the topic of Decision Trees(page 720) from book Artificial Intelligence A Modern Approach 3rd edition. 我正在阅读《人工智能现代方法》第三版中的决策树(第720页)主题。 The book is describing some cases that may occur after we split the training set(examples) by choosing an attribute. 本书描述了在通过选择属性拆分训练集(示例)之后可能发生的一些情况。 One of the case mentioned is 提到的一种情况是

If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality classification of all the examples that were used in constructing the node's parent. 如果没有示例,则表示没有观察到此属性值组合的示例,并且我们返回了一个默认值,该默认值是根据构造节点父级时使用的所有示例的多个分类计算得出的。

I understand that by plurality classification they mean majority rule. 我知道,复数分类是指多数制。 But I am unable to understand the above cases ie when could it occur. 但我无法理解上述情况,即何时会发生。 Some example of decision tree where the above cases becomes true. 以上情况变为现实的决策树示例。

Think of the problem as constructing a 2D table of occurrence counts where the column represents some feature or class to be considered and the rows represent particular configurations of other variables. 将问题视为构建二维计数表,其中列表示要考虑的某些要素或类,而行表示其他变量的特定配置。

for example, 例如,

X Y Z | class counts
------+-------------
1 1 1 | ...
1 1 2 | ...
1 1 3 | ...

The table represents the joint distribution of the training set. 该表表示训练集的联合分布。

A particular combination of X, Y and Z (say 1,3,1) may not have been seen during training. 在训练过程中可能没有看到X,Y和Z的特定组合(例如1,3,1)。 The more variables you have, the more likely you will encounter unseen combinations. 您拥有的变量越多,遇到未看见的组合的可能性就越大。 If you have 10 variables each with two states then there are 1024 possible configurations of those variables. 如果您有10个变量并分别具有两种状态,则这些变量有1024种可能的配置。 If there are three states for each then the number of configurations would be 3 ^ 10, etc. 如果每个状态都有三个状态,那么配置数将是3 ^ 10,依此类推。

Frankly, I would use 1/numberCols for any particular column with a missing row as you don't really have any information regarding it. 坦白说,我将对缺少行的任何特定列使用1 / numberCols,因为您实际上没有任何信息。 You could use 1/Sum(rows) for each column but this may unnecessarily bias the result. 您可以为每一列使用1 / Sum(rows),但这可能会不必要地使结果产生偏差。 Depends on the data. 取决于数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM