简体   繁体   中英

Spark: error when reading dataframe

I am trying to read the first rows of a Spark DataFrame I created as below:

#read file    
datasetDF = sqlContext.read.format('com.databricks.spark.csv').options(delimiter=';', header='true',inferschema='true').load(dataset)
#vectorize
ignore = ['value']
vecAssembler = VectorAssembler(inputCols=[x for x in datasetDF.columns if x not in ignore], outputCol="features")
#split training - test set
(split20DF, split80DF) = datasetDF.randomSplit([1.0, 4.0],seed)
testSetDF = split20DF.cache()
trainingSetDF = split80DF.cache()

print trainingSetDF.take(5)

However if I run this code I get the following error (cause by the last line print trainingSetDF.take(5) ):

: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 3.0 failed 4 times, most recent failure: 
Lost task 0.3 in stage 3.0 (TID 7, 192.168.101.102): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile:
org.codehaus.janino.JaninoRuntimeException:
 Code of method "
(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class
 **"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB**

I need to add that this happen only when I have quite a lot of features (more than 256). What am I doing wrong?

Thanks, Florent

If you have an ID variable that you can use to join the data I found a work around for the randomSplit() error on wide data (if you don't you can easily make one before splitting and still use the workaround). Make sure you make separate variable names for the split part and don't use the same ones (I used train1/valid1), otherwise you'll get the same error as I think its just making some pointer to the same RDD. This is probably one of the stupidest bugs I've ever I seen. Our data wasn't even that wide.

Y            = 'y'
ID_VAR       = 'ID'
DROPS        = [ID_VAR]

train = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://emr-related-files/train.csv')
test = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://emr-related-files/test.csv')
train.show()
print(train.count)

(train1,valid1) = train.select(ID_VAR).randomSplit([0.7,0.3], seed=123)
train = train1.join(train,ID_VAR,'inner')
valid = valid1.join(train, ID_VAR,'inner')

train.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM