Spark: error when reading dataframe

Question

I am trying to read the first rows of a Spark DataFrame I created as below:

#read file    
datasetDF = sqlContext.read.format('com.databricks.spark.csv').options(delimiter=';', header='true',inferschema='true').load(dataset)
#vectorize
ignore = ['value']
vecAssembler = VectorAssembler(inputCols=[x for x in datasetDF.columns if x not in ignore], outputCol="features")
#split training - test set
(split20DF, split80DF) = datasetDF.randomSplit([1.0, 4.0],seed)
testSetDF = split20DF.cache()
trainingSetDF = split80DF.cache()

print trainingSetDF.take(5)

However if I run this code I get the following error (cause by the last line print trainingSetDF.take(5) ):

: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 3.0 failed 4 times, most recent failure: 
Lost task 0.3 in stage 3.0 (TID 7, 192.168.101.102): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile:
org.codehaus.janino.JaninoRuntimeException:
 Code of method "
(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class
 **"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB**

I need to add that this happen only when I have quite a lot of features (more than 256). What am I doing wrong?

Thanks, Florent

Answer 1

If you have an ID variable that you can use to join the data I found a work around for the randomSplit() error on wide data (if you don't you can easily make one before splitting and still use the workaround). Make sure you make separate variable names for the split part and don't use the same ones (I used train1/valid1), otherwise you'll get the same error as I think its just making some pointer to the same RDD. This is probably one of the stupidest bugs I've ever I seen. Our data wasn't even that wide.

Y            = 'y'
ID_VAR       = 'ID'
DROPS        = [ID_VAR]

train = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://emr-related-files/train.csv')
test = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://emr-related-files/test.csv')
train.show()
print(train.count)

(train1,valid1) = train.select(ID_VAR).randomSplit([0.7,0.3], seed=123)
train = train1.join(train,ID_VAR,'inner')
valid = valid1.join(train, ID_VAR,'inner')

train.show()

Spark: error when reading dataframe

Question

1 answers

solution1
0 2017-06-04 21:51:47

Spark: error when reading dataframe

Question

1 answers

solution1 0 2017-06-04 21:51:47

solution1
0 2017-06-04 21:51:47