简体   繁体   English

如何使用StringIndexer fit

[英]How to use StringIndexer fit

I am trying to use StringIndexer to transform my categorical variables into numerical variables. 我正在尝试使用StringIndexer将分类变量转换为数值变量。 As such, I am trying to follow up with the example labeled out here: https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer 因此,我尝试跟着此处标记的示例: https : //spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer

I have the following code: 我有以下代码:

from pyspark import SparkContext, SparkConf
from pyspark import sql
from pyspark.ml.feature import StringIndexer

conf = (SparkConf()
    .setMaster("local")
    .setAppName("My app")
    .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
sqlContext = sql.SQLContext(sc)
df = sqlContext.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

But I get the following error: 但是我收到以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
: java.lang.IllegalArgumentException
    at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
    at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
    at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:443)
    at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:426)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
    at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:426)
    at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
    at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
    at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:257)
    at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:256)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:256)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2068)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:373)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:373)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:372)
    at org.apache.spark.rdd.RDD$$anonfun$countByValue$1.apply(RDD.scala:1204)
    at org.apache.spark.rdd.RDD$$anonfun$countByValue$1.apply(RDD.scala:1204)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.countByValue(RDD.scala:1203)
    at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:113)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.base/java.lang.Thread.run(Thread.java:844)


Process finished with exit code 1

I was wondering if anyone knows what went wrong. 我想知道是否有人知道出了什么问题。

This is solved. 解决了。 The problem had to due with the fact that pyspark does not currently work with Java 9. Instead it works with only Java 8. 问题必须归因于pyspark当前不适用于Java9。相反,它仅适用于Java 8。

I had to delete java 9 and use set my bash profile to use java 8 instead. 我不得不删除Java 9,并使用set bash profile来使用Java 8。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用StringIndexer生成数字变量? - How to use StringIndexer to generate numeric variables? 什么是StringIndexer,VectorIndexer,以及如何使用它们? - What is StringIndexer , VectorIndexer, and how to use them? PySpark:如何使用 `StringIndexer` 对字符串数组列进行 label 编码 - PySpark: how to use `StringIndexer` to do label encoding with the string array column Spark(OneHotEncoder + StringIndexer)= FeatureImportance怎么样? - Spark (OneHotEncoder + StringIndexer) = FeatureImportance how to? spark.ml StringIndexer在fit()上抛出“看不见的标签” - spark.ml StringIndexer throws 'Unseen label' on fit() Spark StringIndexer.fit在大型记录上非常慢 - Spark StringIndexer.fit is very slow on large records ml.feature.StringIndexer和IndexToString如何工作? - How does ml.feature.StringIndexer and IndexToString work? 在 Pyspark 中使用 Stringindexer 时如何将列名作为变量 - How to place column name as variable when using Stringindexer in Pyspark 如何在没有StringIndexer的情况下在Spark ML中进行二进制分类 - How to make binary classication in Spark ML without StringIndexer (PySpark) StringIndexer Error: py4j.protocol.Py4JJavaError: An error occurred while calling o46.fit - (PySpark) StringIndexer Error: py4j.protocol.Py4JJavaError: An error occurred while calling o46.fit
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM