Pyspark converting rdd to dataframe with nulls

Question

I am using pyspark (1.6) and elasticsearch-hadoop (5.1.1). I am getting my data from elasticsearch into a rdd format via:

es_rdd = sc.newAPIHadoopRDD(                                               
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",          
    keyClass="org.apache.hadoop.io.NullWritable",                          
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",     
    conf=es_read_conf)

Here es_read_conf is just a dictionary of my ES cluster, as sc the SparkContext object. This works fine and I get the rdd objects fine.

I'd like to convert this to a dataframe using

df = es_rdd.toDF()

but I get the error:

ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

Giving the toDF method a sampleSize results in the same error. From what I understand this is occuring because pyspark is unable to determine the type of each field. I know that there are fields in my elasticsearch cluster that are all null.

What is the best way to convert this to a dataframe?

Answer 1

The best way it to tell Spark types of data you are converting to. Please see documentation of createDataFrame with fifth example (the one with StructType inside)

Pyspark converting rdd to dataframe with nulls

Question

1 answers

solution1
2 ACCPTED 2017-01-14 11:32:19

Pyspark converting rdd to dataframe with nulls

Question

1 answers

solution1 2 ACCPTED 2017-01-14 11:32:19

solution1
2 ACCPTED 2017-01-14 11:32:19