简体   繁体   中英

Pyspark converting rdd to dataframe with nulls

I am using pyspark (1.6) and elasticsearch-hadoop (5.1.1). I am getting my data from elasticsearch into a rdd format via:

es_rdd = sc.newAPIHadoopRDD(                                               
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",          
    keyClass="org.apache.hadoop.io.NullWritable",                          
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",     
    conf=es_read_conf)

Here es_read_conf is just a dictionary of my ES cluster, as sc the SparkContext object. This works fine and I get the rdd objects fine.

I'd like to convert this to a dataframe using

df = es_rdd.toDF()

but I get the error:

ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

Giving the toDF method a sampleSize results in the same error. From what I understand this is occuring because pyspark is unable to determine the type of each field. I know that there are fields in my elasticsearch cluster that are all null.

What is the best way to convert this to a dataframe?

The best way it to tell Spark types of data you are converting to. Please see documentation of createDataFrame with fifth example (the one with StructType inside)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM