简体   繁体   中英

Spark - StringIndexer Vs OneHotEncoderEstimator

I am learning Spark and I have below code in one of the tutorial. I understand the dataframe is one hot encoded in below code but what I don't understand is why StringIndexer is used? Is StringIndexer should be used in conjunction with OneHotEncoderEstimator? val si = new StringIndexer() .setHandleInvalid("keep") .setInputCol(ProcuctTypeCol) .setOutputCol(ProcuctTypeSIOutCol)

val ohe = new OneHotEncoderEstimator()
      .setHandleInvalid("keep")
      .setInputCols(Array(si.getOutputCol))
      .setOutputCols(Array(ProductTypeOHEOutCol))

val pipeline = new Pipeline()
  .setStages(Array(si, ohe))

Thanks

SI transform the string value to integer and OHE make integer in ohe encoding, if your column is in int like 1,2,3 you can apply OHE directly. But if your label in string like A,B,C you have to use SI first then chain OHE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM