spark将多行转换为具有多个集合的单行

Question

i am looking for ideas on how to solve below scenario.我正在寻找有关如何解决以下情况的想法。 my use case is in java spark, but looking for ideas on how to do it irrespective of language as i ran out of ideas我的用例是在 java spark 中，但是在我没有想法时正在寻找关于如何做到这一点的想法，而不管语言如何

i have unstructured data as below我有如下非结构化数据

98480|PERSON|TOM|GREER|1982|12|27
98480|PHONE|CELL|732|201|6789
98480|PHONE|HOME|732|123|9876
98480|ADDR|RES|102|JFK BLVD|PISCATAWAY|NJ|08854
98480|ADDR|OFF|211|EXCHANGE PL|JERSEY CITY|NJ|07302
98481|PERSON|LIN|JASSOY|1976|09|15
98481|PHONE|CELL|908|398|3389
98481|PHONE|HOME|917|363|2647
98481|ADDR|RES|111|JOURNAL SQ|JERSEY CITY|NJ|07704
98481|ADDR|OFF|365|DOWNTOWN NEWYORK|NEWYORK CITY|NY|10001

i am trying to convert them into row with persondata with set of phone and addr something like below, basically single row for each personId我正在尝试将它们转换为带有电话和地址集的个人数据行，如下所示，基本上是每个 personId 的单行

+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
|personId|type  |firstName|lastName|year|month|day|Phone                                                                | addr                                                                                                                 |                                                                                                                                                               |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
|98481   |PERSON|LIN      |JASSOY  |1976|09   |15 |[[PHONE, HOME, 917, 363, 2647], [PHONE, CELL, 908, 398, 3389]]       | [[ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY, 10001], [ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ, 07704]]  |
|98480   |PERSON|TOM      |GREER   |1982|12   |27 |[[PHONE, HOME, 732, 123, 9876], [PHONE, CELL, 732, 201, 6789]]       | [[ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ, 08854], [ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ, 07302]]           |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+

with the below code使用以下代码

Dataset<Row> dataset = groupedDataset
                .agg(collect_set(struct(phoneRow.col("type").as("collType"), phoneRow.col("phoneType").as("phoneType"),
                        phoneRow.col("areaCode").as("areaCode"), phoneRow.col("phoneMiddle").as("phoneMiddle"),
                        phoneRow.col("ext").as("ext"), addressRow.col("type").as("collType"),
                        addressRow.col("addrType").as("addrType"), addressRow.col("addr1").as("rowType"),
                        addressRow.col("addr2").as("addr2"), addressRow.col("city").as("city"),
                        addressRow.col("state").as("state"), addressRow.col("zipCode").as("zipCode"))).as("addrPhone"));

output is as below, but not the format i am looking for输出如下，但不是我要找的格式

+--------+------+---------+--------+----+-----+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|personId|type  |firstName|lastName|year|month|day|addrPhone                                                                                                                                                                                                                                                                                                                                                 |
+--------+------+---------+--------+----+-----+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|98481   |PERSON|LIN      |JASSOY  |1976|09   |15 |[[PHONE, HOME, 917, 363, 2647, ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY, 10001], [PHONE, HOME, 917, 363, 2647, ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ, 07704], [PHONE, CELL, 908, 398, 3389, ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ, 07704], [PHONE, CELL, 908, 398, 3389, ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY, 10001]]|
|98480   |PERSON|TOM      |GREER   |1982|12   |27 |[[PHONE, HOME, 732, 123, 9876, ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ, 08854], [PHONE, CELL, 732, 201, 6789, ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ, 08854], [PHONE, CELL, 732, 201, 6789, ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ, 07302], [PHONE, HOME, 732, 123, 9876, ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ, 07302]]                  |
+--------+------+---------+--------+----+-----+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

looking for ideas to fix above issue寻找解决上述问题的想法

update: i was able to get the output as expected but i am not sure how much effective it is and looks like having lot of boiler plate code with lot of joins and dataframes.更新：我能够按预期获得输出，但我不确定它的效果有多大，并且看起来有很多样板代码和很多连接和数据帧。 this is sample data i am playing with to understand spark, but the real data i would be working will have lot of complex transformations and this code doesn't look effective这是我用来理解 spark 的示例数据，但我将使用的真实数据将有很多复杂的转换，而且这段代码看起来并不有效

here is updated code这是更新的代码

Dataset<Row> groupedPhoneDataSet = groupedDataset.agg(collect_set(struct(phoneRow.col("type").as("phColType"),
                phoneRow.col("phoneType").as("phoneType"), phoneRow.col("areaCode").as("areaCode"),
                phoneRow.col("phoneMiddle").as("phoneMiddle"), phoneRow.col("ext").as("ext"))).as("phoneRec"));

        Dataset<Row> groupedAddrDataSet = groupedDataset
                .agg(collect_set(struct(addressRow.col("type").as("addrColType"),
                        addressRow.col("addrType").as("addrType"), addressRow.col("addr1").as("addr1"),
                        addressRow.col("addr2").as("addr2"), addressRow.col("city").as("city"),
                        addressRow.col("state").as("state"), addressRow.col("zipCode").as("zipCode"))).as("addrRec"));

        Dataset<Row> finalDataSet = groupedAddrDataSet
                .join(groupedPhoneDataSet,
                        groupedAddrDataSet.col("personId").equalTo(groupedPhoneDataSet.col("personId")))
                .select(groupedPhoneDataSet.col("personId"), groupedPhoneDataSet.col("type"),
                        groupedPhoneDataSet.col("firstName"), groupedPhoneDataSet.col("lastName"),
                        groupedPhoneDataSet.col("year"), groupedPhoneDataSet.col("month"),
                        groupedPhoneDataSet.col("day"), col("phoneRec"), col("addrRec"));

here is output i got这是我得到的输出

+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|personId|type  |firstName|lastName|year|month|day|phoneRec                                                      |addrRec                                                                                                            |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|98481   |PERSON|LIN      |JASSOY  |1976|09   |15 |[[PHONE, CELL, 908, 398, 3389], [PHONE, HOME, 917, 363, 2647]]|[[ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ, 07704], [ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY, 10001]]|
|98480   |PERSON|TOM      |GREER   |1982|12   |27 |[[PHONE, CELL, 732, 201, 6789], [PHONE, HOME, 732, 123, 9876]]|[[ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ, 07302], [ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ, 08854]]         |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+

is there a way i can do it without creating lot of dataframes有没有一种方法可以在不创建大量数据帧的情况下做到这一点

Answer 1

if you are okay with creating multiple data frames then, split each type of records into different data frames and do group by personId, join all the three data frames on person id.如果您可以创建多个数据框，那么将每种类型的记录拆分为不同的数据框并按 personId 进行分组，将所有三个数据框连接到 person id 上。

find the below code that I tried, let me know if it solves your problem.找到我尝试过的以下代码，让我知道它是否解决了您的问题。

    import org.apache.spark.SparkConf
    import org.apache.spark.sql.functions.{col, collect_list, struct}

    object Test {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
        conf.setAppName("Leads Processing Job").setMaster("local[1]")
        val sparkContext = new org.apache.spark.SparkContext(conf)
        val sqlContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
        val df = sqlContext.read.option("delimiter","|").format("csv").load("data.csv")
        df.printSchema()
        val df_person = df.where("_c1 = 'PERSON'")
          .select(col("_c0").as("personId"),col("_c1").as("type")
            ,col("_c2").as("firstName"),col("_c3").as("lastName")
            ,col("_c4").as("year"),col("_c5").as("month")
            ,col("_c6").as("day"))

        val df_address = df.where("_c1 = 'ADDR'")
        val df_phone = df.where("_c1 = 'PHONE'")
        val df_addr_f = df_address
          .withColumn("addr",struct(col("_c1"),col("_c2")
            ,col("_c3"),col("_c4"),col("_c5"),col("_c6")))
          .groupBy(col("_c0").as("personId")).agg(collect_list(col("addr")).as("addr"))

        val df_phone_f = df_phone.groupBy(col("_c0").as("personId")).agg(collect_list(struct(col("_c1"),col("_c2")
          ,col("_c3"),col("_c4"),col("_c5"))).as("Phone"))

        val final_df = df_person.join(df_addr_f,"personId").join(df_phone_f,"personId")

        final_df.show(false)
      }
    }

It produces below output,它产生以下输出，

            +--------+------+---------+--------+----+-----+---+-----------------------------------------------------------------------------------------------------+--------------------------------------------------------------+
            |personId|type  |firstName|lastName|year|month|day|addr                                                                                                 |Phone                                                         |
            +--------+------+---------+--------+----+-----+---+-----------------------------------------------------------------------------------------------------+--------------------------------------------------------------+
            |98480   |PERSON|TOM      |GREER   |1982|12   |27 |[[ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ], [ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ]]         |[[PHONE, CELL, 732, 201, 6789], [PHONE, HOME, 732, 123, 9876]]|
            |98481   |PERSON|LIN      |JASSOY  |1976|09   |15 |[[ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ], [ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY]]|[[PHONE, CELL, 908, 398, 3389], [PHONE, HOME, 917, 363, 2647]]|
            +--------+------+---------+--------+----+-----+---+-----------------------------------------------------------------------------------------------------+--------------------------------------------------------------+

Answer 2

IIUC, you can read your data in line-mode, do some data manipulations and then use collect_list or collect_set to get the desired result: IIUC，您可以以行模式读取数据，进行一些数据操作，然后使用collect_list或collect_set来获得所需的结果：

from pyspark.sql.functions import expr, substring_index

# read the files into dataframe with a single column named `value`
df = spark.read.text('/path/to/file/')

Split the lines into two columns: personId (the 1st field) and an ArrayType column data (the rest of the fields):将这些行分成两列： personId （第一个字段）和一个 ArrayType 列data （其余字段）：

df1 = df.withColumn('personId', substring_index('value', '|', 1)) \
    .selectExpr('personId', 'split(substr(value, length(personId)+2), "[|]") as data')    
#+--------+--------------------+
#|personId|                data|
#+--------+--------------------+
#|   98480|[PERSON, TOM, GRE...|
#|   98480|[PHONE, CELL, 732...|
#|   98480|[PHONE, HOME, 732...|
#|   98480|[ADDR, RES, 102, ...|
#|   98480|[ADDR, OFF, 211, ...|
#|   98481|[PERSON, LIN, JAS...|
#|   98481|[PHONE, CELL, 908...|
#|   98481|[PHONE, HOME, 917...|
#|   98481|[ADDR, RES, 111, ...|
#|   98481|[ADDR, OFF, 365, ...|
#+--------+--------------------+

Use groupby + collect_list(or collect_set).使用 groupby + collect_list（或 collect_set）。 notice that collect_list/collect_set will skip items having NULL values .请注意， collect_list/collect_set 将跳过具有 NULL 值的项目。 Below we use collect_list to create 3 ArrayType columns based on the value of data[0] :下面我们使用collect_list根据data[0]的值创建 3 个 ArrayType 列：

(1) If data[0] == PHONE or ADDR , convert data into StructType, the result will be array of Structs. (1) 如果 data[0] == PHONE或ADDR ，将data转换为 StructType，结果将是 Structs 数组。

(2) if data[0] == PERSON , keep data as ArrayType, take the first item (named d1 ) from the resulting array of arrays and then use selectExpr to convert this array d1 into 6 separate columns. (2) 如果 data[0] == PERSON ，将data保留为 ArrayType，从结果数组中取出第一项（名为d1 ），然后使用 selectExpr 将此数组d1转换为 6 个单独的列。

df1.groupby('personId') \
    .agg(
      expr("collect_list(IF(data[0] = 'PERSON', data, NULL))[0] as d1"),
      expr("""
        collect_list(
          IF(data[0] = 'PHONE'
          , (data[0] as phColType,
             data[1] as phoneType,
             data[2] as areaCode,
             data[3] as phoneMiddle,
             data[4] as ext)
          , NULL)
        ) AS Phone"""),
      expr("""
        collect_list(
          IF(data[0] = 'ADDR'
          , (data[0] as addrColType,
             data[1] as addrType,
             data[2] as addr1,
             data[3] as addr2,
             data[4] as city,
             data[5] as state,
             data[6] as zipCode)
          , NULL)
        ) AS Addr""")
    ).selectExpr(
      'personId',
      'd1[0] as type',
      'd1[1] as firstName',
      'd1[2] as lastName',
      'd1[3] as year',
      'd1[4] as month',
      'd1[5] as day',
      'Phone',
      'Addr'
    ).show(truncate=False)

The result (Both Phone and Addr are array of structs):结果（Phone 和 Addr 都是结构数组）：

+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|personId|type  |firstName|lastName|year|month|day|Phone                                                         |Addr                                                                                                               |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|98481   |PERSON|LIN      |JASSOY  |1976|09   |15 |[[PHONE, CELL, 908, 398, 3389], [PHONE, HOME, 917, 363, 2647]]|[[ADDR, RES, 111, JOURNAL SQ, JERSEY CITY, NJ, 07704], [ADDR, OFF, 365, DOWNTOWN NEWYORK, NEWYORK CITY, NY, 10001]]|
|98480   |PERSON|TOM      |GREER   |1982|12   |27 |[[PHONE, CELL, 732, 201, 6789], [PHONE, HOME, 732, 123, 9876]]|[[ADDR, RES, 102, JFK BLVD, PISCATAWAY, NJ, 08854], [ADDR, OFF, 211, EXCHANGE PL, JERSEY CITY, NJ, 07302]]         |
+--------+------+---------+--------+----+-----+---+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+

spark将多行转换为具有多个集合的单行

问题描述

2 个解决方案

解决方案1
0 2020-01-21 04:13:18

解决方案2
0 2020-01-21 19:53:39

spark将多行转换为具有多个集合的单行

问题描述

2 个解决方案

解决方案1 0 2020-01-21 04:13:18

解决方案2 0 2020-01-21 19:53:39

解决方案1
0 2020-01-21 04:13:18

解决方案2
0 2020-01-21 19:53:39