简体   繁体   中英

SparkException: Chi-square test expect factors

I have a dataset containing 42 features and 1 label. I want to apply the selection method chi square selector of the library spark ML before executing Decision tree for the detection of anomaly but I meet this error during the applciation of chi square selector:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 45, localhost, executor driver): org.apache.spark.SparkException: Chi-square test expect factors (categorical values) but found more than 10000 distinct values in column 11.

Here is my source code:

from pyspark.ml.feature import ChiSqSelector
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",outputCol="features2", labelCol="label")
result = selector.fit(dfa1).transform(dfa1)
result.show()

As you can see in error msg your features column contains more than 10000 distinct values in vector and looks like they are continous not categorical, ChiSq can handle only 10k categories and you can't increase this value.

  /**
   * Max number of categories when indexing labels and features
   */
  private[spark] val maxCategories: Int = 10000

In this case you can use VectorIndexer with .setMaxCategories() parameter < 10k to prepare your data. You can try other methods to prepare data but it will not work until your count of distinct values in vector is > 10k.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM