Pyspark - 将rdd转换为dataframe时数据设置为null

Question

With PySpark I'm trying to convert a RDD of nested dicts into a dataframe but I'm losing data in some fields which are set to null. 使用PySpark我试图将嵌套dicts的RDD转换为数据帧，但是我在一些设置为null的字段中丢失了数据。

Here's the code : 这是代码：

sc = SparkContext()
sqlContext = SQLContext(sc)

def convert_to_row(d):
    return Row(**d)

df2 = sc.parallelize([{"id": "14yy74hwogxoyl2l3v", "geoloc": {"country": {"geoname_id": 3017382, "iso_code": "FR", "name": "France"}}}]).map(convert_to_row).toDF()
df2.printSchema()
df2.show()
df2.toJSON().saveAsTextFile("/tmp/json.test")

When I'm having a look at /tmp/json.test, here's the content (after manually indent): 当我看一下/tmp/json.test时，这里是内容（手动缩进后）：

{
    "geoloc": {
        "country": {
            "name": null,
            "iso_code": null,
            "geoname_id": 3017382
        }
    },
    "id": "14yy74hwogxoyl2l3v"
}

iso_code and name have been converted to null . iso_code和name已转换为null 。

Can anyone help me with it ? 任何人都可以帮助我吗？ I can't understand it. 我无法理解。

I'm using Python 2.7 and Spark 2.0.0 我正在使用Python 2.7和Spark 2.0.0

Thanks ! 谢谢！

Answer 1

This happens because you don't use Row correctly. 发生这种情况是因为您没有正确使用Row 。 Row constructor is not recursive and operated only on the top level fields. Row构造函数不是递归的，仅在顶级字段上操作。 When you take a look at the schema: 当你看一下架构时：

root
 |-- geoloc: map (nullable = true)
 |    |-- key: string
 |    |-- value: map (valueContainsNull = true)
 |    |    |-- key: string
 |    |    |-- value: long (valueContainsNull = true)
 |-- id: string (nullable = true)

you'll see that geoloc is represented as map<string,struct<string,long>> . 你会看到geoloc表示为map<string,struct<string,long>> 。 Correct representation of the structure would use nested Rows : 结构的正确表示将使用嵌套的Rows ：

Row(
    id="14yy74hwogxoyl2l3v", 
    geoloc=Row(
        country=Row(geoname_id=3017382, iso_code="FR", name="France")))

while what you pass is equivalent to: 而你通过的相当于：

Row(
    geoloc={'country': 
        {'geoname_id': 3017382, 'iso_code': 'FR', 'name': 'France'}}, 
        id='14yy74hwogxoyl2l3v')

Since creating correct implementation has to cover a number of border case it would make more sense to use intermediate JSON representation and Spark JSON data source. 由于创建正确的实现必须涵盖许多边界情况，因此使用中间JSON表示和Spark JSON数据源会更有意义。

Answer 2

Following the explanation already provided by @user6910411 (and saving me the time to do it myself), the remedy (ie the intermediate JSON representation) is to use read.json instead of toDF and your function: 按照@ user6910411已经提供的解释（并节省了我自己做的时间），补救措施（即中间JSON表示）是使用read.json而不是toDF和你的函数：

spark.version
# u'2.0.2'

jsonRDD = sc.parallelize([{"id": "14yy74hwogxoyl2l3v", "geoloc": {"country": {"geoname_id": 3017382, "iso_code": "FR", "name": "France"}}}])

df = spark.read.json(jsonRDD)
df.collect()
# result:
[Row(geoloc=Row(country=Row(geoname_id=3017382, iso_code=u'FR', name=u'France')), id=u'14yy74hwogxoyl2l3v')]

# just to have a look at what will be saved:
df.toJSON().collect()
# result:
[u'{"geoloc":{"country":{"geoname_id":3017382,"iso_code":"FR","name":"France"}},"id":"14yy74hwogxoyl2l3v"}']

df.toJSON().saveAsTextFile("/tmp/json.test")

For comparison, here is how your own df2 looks: 为了比较，以下是您自己的df2外观：

df2.collect()
# result:
[Row(geoloc={u'country': {u'geoname_id': 3017382, u'iso_code': None, u'name': None}}, id=u'14yy74hwogxoyl2l3v')]

df2.toJSON().collect()
# result:
[u'{"geoloc":{"country":{"name":null,"iso_code":null,"geoname_id":3017382}},"id":"14yy74hwogxoyl2l3v"}']

Pyspark - 将rdd转换为dataframe时数据设置为null

问题描述

2 个解决方案

解决方案1
2 2017-11-03 16:51:23

解决方案2
1 已采纳 2017-11-03 16:59:36

Pyspark - 将rdd转换为dataframe时数据设置为null

问题描述

2 个解决方案

解决方案1 2 2017-11-03 16:51:23

解决方案2 1 已采纳 2017-11-03 16:59:36

解决方案1
2 2017-11-03 16:51:23

解决方案2
1 已采纳 2017-11-03 16:59:36