简体   繁体   English

如何将 ArrayType(StructType) 的 spark dataframe 列拆分为 pyspark 中的多个列?

[英]How to split a spark dataframe column of ArrayType(StructType) to multiple columns in pyspark?

I am reading xml using databricks spark xml with below schema.我正在阅读 xml 使用 databricks spark xml 和以下架构。 the subelement X_PAT can occur more than one time, to handle this I have used arraytype(structtype),ne xt transformation is to create multiple columns out of this single column.子元素 X_PAT 可以出现多次,为了处理这个问题,我使用了 arraytype(structtype),下一个转换是从这一列中创建多个列。

<root_tag>
   <id>fff9</id>
   <X1000>
      <X_PAT>
         <X_PAT01>IC</X_PAT01>
         <X_PAT02>EDISUPPORT</X_PAT02>
         <X_PAT03>TE</X_PAT03>
      </X_PAT>
      <X_PAT>
         <X_PAT01>IC1</X_PAT01>
         <X_PAT02>EDISUPPORT1</X_PAT02>
         <X_PAT03>TE1</X_PAT03>
      </X_PAT>
   </X1000>
</root_tag>
from pyspark.sql import SparkSession
from pyspark.sql.types import *

jar_path = "/Users/nsrinivas/com.databricks_spark-xml_2.10-0.4.1.jar"

spark = SparkSession.builder.appName("Spark - XML read").master("local[*]") \
    .config("spark.jars", jar_path) \
    .config("spark.executor.extraClassPath", jar_path) \
    .config("spark.executor.extraLibrary", jar_path) \
    .config("spark.driver.extraClassPath", jar_path) \
    .getOrCreate()

xml_schema = StructType()
xml_schema.add("id", StringType(), True)
x1000 = StructType([
    StructField("X_PAT",
                ArrayType(StructType([
                    StructField("X_PAT01", StringType()),
                    StructField("X_PAT02", StringType()),
                    StructField("X_PAT03", StringType())]))),
])
xml_schema.add("X1000", x1000, True)

df = spark.read.format("xml").option("rowTag", "root_tag").option("valueTag", False) \
    .load("root_tag.xml", schema=xml_schema)

df.select("id", "X1000.X_PAT").show(truncate=False)

I get the output as below:我得到 output 如下:

+------------+--------------------------------------------+
|id          |X_PAT                                       |
+------------+--------------------------------------------+
|fff9        |[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+------------+--------------------------------------------+

but I want the X_PAT to be flatten and create multiple columns like below then I will rename the colums.但我希望 X_PAT 被展平并创建多个列,如下所示,然后我将重命名这些列。

+-----+-------+-------+-------+-------+-------+-------+
|id   |X_PAT01|X_PAT02|X_PAT03|X_PAT01|X_PAT02|X_PAT03|
+-----+-------+-------+-------+-------+-------+-------+
|fff9 |IC1    |SUPPORT1|TE1   |IC2   |SUPPORT2|TE2    |
+-----+-------+-------+-------+-------+-------+-------+

then i would rename the new columns as below然后我将重命名新列如下

id|XPAT_1_01|XPAT_1_02|XPAT_1_03|XPAT_2_01|XPAT_2_02|XPAT_2_03|

I tried using X1000.X_PAT.* but it is throwing below error pyspark.sql.utils.AnalysisException: 'Can only star expand struct data types.我尝试使用X1000.X_PAT.*但它抛出错误 pyspark.sql.utils.AnalysisException: 'Can only star expand struct data types。 Attribute: ArrayBuffer(L_1000A, S_PER) ;'属性: ArrayBuffer(L_1000A, S_PER) ;'

Any ideas please?请问有什么想法吗?

Try this:尝试这个:

df = spark.createDataFrame([('1',[['IC1', 'SUPPORT1', 'TE1'],['IC2', 'SUPPORT2', 'TE2']]),('2',[['IC1', 'SUPPORT1', 'TE1'],['IC2','SUPPORT2', 'TE2']])],['id','X_PAT01'])

在此处输入图像描述

Define a function to parse the data定义一个function来解析数据

def create_column(df):
    data = df.select('X_PAT01').collect()[0][0]
    for each_list in range(len(data)):
        for each_item in range(len(data[each_list])):
            df = df.withColumn('X_PAT_'+str(each_list)+'_0'+str(each_item), F.lit(data[each_list][each_item]))
    return df

calling打电话

df = create_column(df)

output output

在此处输入图像描述

This is a simple approach to horizontally explode array elements as per your requirement:这是根据您的要求水平分解数组元素的简单方法:

df2=(df1
     .select('id',
             *(col('X_PAT')
               .getItem(i) #Fetch the nested array elements
               .getItem(j) #Fetch the individual string elements from each nested array element
               .alias(f'X_PAT_{i+1}_{str(j+1).zfill(2)}') #Format the column alias
               for i in range(2) #outer loop
               for j in range(3) #inner loop
              )
            )
    )

Input vs Output:输入与 Output:

Input(df1):

+----+--------------------------------------------+
|id  |X_PAT                                       |
+----+--------------------------------------------+
|fff9|[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+----+--------------------------------------------+

Output(df2):

+----+----------+----------+----------+----------+----------+----------+
|  id|X_PAT_1_01|X_PAT_1_02|X_PAT_1_03|X_PAT_2_01|X_PAT_2_02|X_PAT_2_03|
+----+----------+----------+----------+----------+----------+----------+
|fff9|       IC1|  SUPPORT1|       TE1|       IC2|  SUPPORT2|       TE2|
+----+----------+----------+----------+----------+----------+----------+

Although this involves for loops, as the operations are directly performed on the dataframe (without collecting/converting to RDD), you should not encounter any issue.虽然这涉及到 for 循环,但由于操作是直接在 dataframe 上执行的(无需收集/转换为 RDD),您应该不会遇到任何问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pySpark:如何在 dataframe 的 arrayType 列中获取 structType 中的所有元素名称? - pySpark: How can I get all element names in structType in arrayType column in a dataframe? Spark:将StructType列拆分为多个列,而无需使用“。*”星号运算符 - Spark: Split StructType column to multiple columns without using the “.*” star operator 如何将Spark Dataframe中的列拆分为多列 - How to split column in Spark Dataframe to multiple columns Pyspark 将 StructType 转换为 ArrayType<StructType> - Pyspark Cast StructType as ArrayType<StructType> 将 Spark 中的多个 ArrayType Columns 合并为一个 ArrayType Column - Combine multiple ArrayType Columns in Spark into one ArrayType Column Pyspark 将 Dataframe 字符串列拆分为多列 - Pyspark Split Dataframe string column into multiple columns 将 Spark Dataframe 字符串列拆分为多列 - Split Spark Dataframe string column into multiple columns 无法在 Spark Dataframe 中将列拆分为多列 - Not able to split the column into multiple columns in Spark Dataframe 如何更改 StructType 或 ArrayType 列中的所有列数据类型? - How to change all columns data types in StructType or ArrayType columns? 将单列拆分为多列的最佳方法 Dataframe PySpark - Best approach to split the single column into multiple columns Dataframe PySpark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM