简体   繁体   English

从数据框创建数组包含两个列

[英]create array from dataframe contain tow column

I have dataframe with the schema 我有带架构的数据框

    root
       |-- _id: long (nullable = true)
       |-- data: array (nullable = true)
             |-- element: struct (containsNull = true)
                |    |    |-- k: string (nullable = true)
                |    |    |-- v: string (nullable = true)
       |-- c : string (nullable = true)

df.show(5) df.show(5)

   ---------------------------------------
   _id  |  data                                                             |c
   1    |[[key1,key2,key3,key4,key5],[value1,value2,value3,value4,value5]]  |c1
   -----------------------------------------------------------------------------
   2    |[ [key1,key3,key2,key6],[value11,value31,value2,value61] ]         |c2
   -----------------------------------------------------------------------------
   3    | [[key7,key1,key3,key8,key9],[value7,value1,value3,value8,value91]]|c3
   -----------------------------------------------------------------------------
   4    |[key3,key2,key4,key5,key10],[value32,value23,value43,value10]]     |c4
   ------------------------------------------------------------------------------
   5    |[[key1 ,key2,key4,key10],[value1,value23,value42,value101]]        |c1
   .
   .
   .
   .       

I want to know if it's possible to get this result and how i must proceed 我想知道是否有可能获得此结果以及我该如何进行

  _id|key1   |key2   |key3   |key4   |key5   |key6   |key7   |key8   |key9  |key10 ...
    1|value1 |value2 |value3 |value4 |value5 |       |       |       |      |
    ----------------------------------------------------------------------------
    2|value11|value2 |value31 |      |       |value6 |       |       |     
    ---------------------------------------------------------------------
    3|value1 |       |value3  |      |       |       |value7 |value8 |value91|
    ----------------------------------------------------------------------------    
    4|       |value23|value32|value43|       |       |       |        |value10
     ---------------------------------------------------------------------------
    5|value1 |value23|       |value42|       |       |       |       |       |value101
    .
    .

I tried to use explode but i did'nt get a result , I tried also to construct an array from the first tow column but it seems difficult. 我尝试使用explode,但没有得到结果,我也尝试从第一个拖曳列构造一个数组,但这似乎很困难。

You need to map this dataframe to one where each row contains data, then you can create a new dataframe with the appropriate column names 您需要将此数据帧映射到每一行包含数据的数据帧,然后可以使用适当的列名创建一个新的数据帧

this should point you in the right direction... 这应该为您指明正确的方向...

column_names = df.select("data").collect()[0][0]
data_df = map(lambda x: x[1],df.select("data").collect())
data_par = sc.parallelize(data_df)
new_df = spark.createDataFrame(data_par, column_names, 0.1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM