简体   繁体   English

管道中的Spark MLLib 2.0分类功能

[英]Spark MLLib 2.0 Categorical Features in pipeline

I'm trying to build a decision tree based on log files. 我正在尝试基于日志文件构建决策树。 Some feature sets are large containing thousands of unique values. 某些功能集很大,包含数千个唯一值。 I'm trying to use the new idioms of pipeline and data frame in Java. 我正在尝试在Java中使用管道和数据框架的新习惯用法。 I've built a pipeline with several StringIndexer pipeline stages for each of the categorical feature columns. 我为每个分类功能列构建了一个包含多个StringIndexer管道阶段的管道。 Then I use a VectorAssembler to create a features vector. 然后我使用VectorAssembler创建一个特征向量。 The resultant data frame looks perfect to me after the VectorAssembler stage. 在VectorAssembler阶段之后,结果数据帧对我来说是完美的。 My pipeline looks approximately like 我的管道看起来很像

StringIndexer-> StringIndexer-> StringIndexer->VectorAssembler->DecisionTreeClassifier StringIndexer-> StringIndexer-> StringIndexer-> VectorAssembler-> DecisionTreeClassifier

However I get the following error: 但是我收到以下错误:

DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 5 has 49 values. DecisionTree要求maxBins(= 32)至少与每个分类要素中的值的数量一样大,但是分类要素5具有49个值。 Considering remove this and other categorical features with a large number of values, or add more training examples. 考虑使用大量值删除此分类和其他分类功能,或添加更多培训示例。

I can resolve this issue by using a Normalizer, but then the resultant Decision tree is unusable for my needs, as I need to generate a DSL decision tree with the original feature values. 我可以通过使用规范化器解决此问题,但随后生成的决策树无法满足我的需求,因为我需要生成具有原始特征值的DSL决策树。 I can't manually set the maxBins because the whole pipeline is executed together. 我无法手动设置maxBins,因为整个管道一起执行。 I would like the resultant decision tree to have the StringIndexer generated values (eg Feature 5 <= 132). 我希望结果决策树具有StringIndexer生成的值(例如,Feature 5 <= 132)。 Additionally, but less important, I'd like to be able to specify my own names for the features (eg instead of 'Feature 5', say 'domain') 此外,但不太重要,我希望能够为功能指定我自己的名称(例如,而不是'功能5',说'域')

The DecisionTree algorithm takes a single maxBins value to decide the number of splits to take. DecisionTree算法采用单个maxBins值来决定要采用的拆分数。 The default value is (=32). 默认值为(= 32)。 maxBins should be greater or equal to the maximum number of categories for categorical features. maxBins应大于或等于分类要素的最大类别数。 Since your feature 5 has 49 different values you need to increase maxBins to 49 or greater. 由于您的功能5有49个不同的值,因此您需要将maxBins增加到49或更大。

The DecisionTree algorithm has several hyperparameters, and tuning them to your data can improve accuracy. DecisionTree算法有几个超参数,根据您的数据调整它们可以提高准确性。 You can do this tuning using Spark's Cross Validation framework, which automatically tests a grid of hyperparameters and chooses the best. 您可以使用Spark的交叉验证框架进行此调整,该框架可自动测试超参数网格并选择最佳。

Here is example in python testing 3 maxBins [49, 52, 55] 以下是python测试中的示例3 maxBins [49,52,55]

dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
paramGrid = ParamGridBuilder().addGrid(dt.maxBins, [49, 52, 55]).addGrid(dt.maxDepth, [4, 6, 8]).addGrid(rf.impurity, ["entropy", "gini"]).build()
pipeline = Pipeline(stages=[labelIndexer, typeIndexer, assembler, dt])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM