Pyspark：将列分解为新的 dataframe

Question

Ihave some pyspark dataframe with schema:我有一些 pyspark dataframe 与架构：

 |-- doc_id: string (nullable = true)     
 |-- msp_contracts: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _el1: string (nullable = true)
 |    |    |-- _el2: long (nullable = true)
 |    |    |-- _el3: string (nullable = true)
 |    |    |-- _el4: string (nullable = true)
 |    |    |-- _el5: string (nullable = true)

How do I get this data frame:如何获取此数据框：

|-- doc_id: string (nullable = true)
|-- _el1: string (nullable = true)
|-- _el3: string (nullable = true)
|-- _el4: string (nullable = true)
|-- _el5: string (nullable = true)

I try in select:我在 select 中尝试：

explode('msp_contracts').select(
 col(u'msp_contracts.element._el1'),
 col(u'msp_contracts.element._el2')
)

but I can have error:但我可能有错误：

'Column' object is not callable

Answer 1

After explode('msp_contracts') spark will add col column as a result of explode (if alias in not provided).在explode('msp_contracts')之后，spark 将添加col列作为explode 的结果（如果没有提供别名）。

df.select("doc_id",explode("msp_contracts")).show()
#+------+---+
#|doc_id|col|
#+------+---+
#|     1|[1]|
#+------+---+

Use col to select _el1 , Try with df_1.select("doc_id",explode("msp_contracts")).select("doc_id",col(u"col._el1")).show()使用col到 select _el1 ，尝试使用df_1.select("doc_id",explode("msp_contracts")).select("doc_id",col(u"col._el1")).show()

Example:

jsn='{"doc_id":1,"msp_contracts":[{"_el1":1}]}'
df=spark.read.json(sc.parallelize([(jsn)]))

#schema
#root
# |-- doc_id: long (nullable = true)
# |-- msp_contracts: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- _el1: long (nullable = true)

df.withColumn("msp_contracts",explode(col("msp_contracts"))).\
select("doc_id","msp_contracts._el1").show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#|     1|   1|
#+------+----+

UPDATE:

df.select("doc_id",explode("msp_contracts")).\
select("doc_id","col._el1").\
show()
#or
df.select("doc_id",explode("msp_contracts")).\
select("doc_id",col(u"col._el1")).\
show()
#+------+----+
#|doc_id|_el1|
#+------+----+
#|     1|   1|
#+------+----+

Answer 2

Work for me:为我工作：

df.select("doc_id",explode("msp_contracts")).\ 
   select("doc_id","col._el1")

With alias and costum column:使用别名和服装栏：

df.select(
        'doc_id',
        explode('msp_contracts').alias("msp_contracts")
        )\
        .select(
            'doc_id',
            col('msp_contracts.el_1').alias('last_period_44fz_customer'),
            col('msp_contracts.el_2').alias('last_period_44fz_customer_inn')
        )\
        .withColumn("load_dtm", now_f())

Pyspark：将列分解为新的 dataframe

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-04-23 17:59:19

解决方案2
0 2020-04-24 06:08:48

Pyspark：将列分解为新的 dataframe

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-04-23 17:59:19

解决方案2 0 2020-04-24 06:08:48

解决方案1
2 已采纳 2020-04-23 17:59:19

解决方案2
0 2020-04-24 06:08:48