[英]pySpark: How can I get all element names in structType in arrayType column in a dataframe?
[英]How to split a spark dataframe column of ArrayType(StructType) to multiple columns in pyspark?
我正在閱讀 xml 使用 databricks spark xml 和以下架構。 子元素 X_PAT 可以出現多次,為了處理這個問題,我使用了 arraytype(structtype),下一個轉換是從這一列中創建多個列。
<root_tag>
<id>fff9</id>
<X1000>
<X_PAT>
<X_PAT01>IC</X_PAT01>
<X_PAT02>EDISUPPORT</X_PAT02>
<X_PAT03>TE</X_PAT03>
</X_PAT>
<X_PAT>
<X_PAT01>IC1</X_PAT01>
<X_PAT02>EDISUPPORT1</X_PAT02>
<X_PAT03>TE1</X_PAT03>
</X_PAT>
</X1000>
</root_tag>
from pyspark.sql import SparkSession
from pyspark.sql.types import *
jar_path = "/Users/nsrinivas/com.databricks_spark-xml_2.10-0.4.1.jar"
spark = SparkSession.builder.appName("Spark - XML read").master("local[*]") \
.config("spark.jars", jar_path) \
.config("spark.executor.extraClassPath", jar_path) \
.config("spark.executor.extraLibrary", jar_path) \
.config("spark.driver.extraClassPath", jar_path) \
.getOrCreate()
xml_schema = StructType()
xml_schema.add("id", StringType(), True)
x1000 = StructType([
StructField("X_PAT",
ArrayType(StructType([
StructField("X_PAT01", StringType()),
StructField("X_PAT02", StringType()),
StructField("X_PAT03", StringType())]))),
])
xml_schema.add("X1000", x1000, True)
df = spark.read.format("xml").option("rowTag", "root_tag").option("valueTag", False) \
.load("root_tag.xml", schema=xml_schema)
df.select("id", "X1000.X_PAT").show(truncate=False)
我得到 output 如下:
+------------+--------------------------------------------+
|id |X_PAT |
+------------+--------------------------------------------+
|fff9 |[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+------------+--------------------------------------------+
但我希望 X_PAT 被展平並創建多個列,如下所示,然后我將重命名這些列。
+-----+-------+-------+-------+-------+-------+-------+
|id |X_PAT01|X_PAT02|X_PAT03|X_PAT01|X_PAT02|X_PAT03|
+-----+-------+-------+-------+-------+-------+-------+
|fff9 |IC1 |SUPPORT1|TE1 |IC2 |SUPPORT2|TE2 |
+-----+-------+-------+-------+-------+-------+-------+
然后我將重命名新列如下
id|XPAT_1_01|XPAT_1_02|XPAT_1_03|XPAT_2_01|XPAT_2_02|XPAT_2_03|
我嘗試使用X1000.X_PAT.*
但它拋出錯誤 pyspark.sql.utils.AnalysisException: 'Can only star expand struct data types。 屬性: ArrayBuffer(L_1000A, S_PER)
;'
請問有什么想法嗎?
嘗試這個:
df = spark.createDataFrame([('1',[['IC1', 'SUPPORT1', 'TE1'],['IC2', 'SUPPORT2', 'TE2']]),('2',[['IC1', 'SUPPORT1', 'TE1'],['IC2','SUPPORT2', 'TE2']])],['id','X_PAT01'])
定義一個function來解析數據
def create_column(df):
data = df.select('X_PAT01').collect()[0][0]
for each_list in range(len(data)):
for each_item in range(len(data[each_list])):
df = df.withColumn('X_PAT_'+str(each_list)+'_0'+str(each_item), F.lit(data[each_list][each_item]))
return df
打電話
df = create_column(df)
output
這是根據您的要求水平分解數組元素的簡單方法:
df2=(df1
.select('id',
*(col('X_PAT')
.getItem(i) #Fetch the nested array elements
.getItem(j) #Fetch the individual string elements from each nested array element
.alias(f'X_PAT_{i+1}_{str(j+1).zfill(2)}') #Format the column alias
for i in range(2) #outer loop
for j in range(3) #inner loop
)
)
)
輸入與 Output:
Input(df1):
+----+--------------------------------------------+
|id |X_PAT |
+----+--------------------------------------------+
|fff9|[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+----+--------------------------------------------+
Output(df2):
+----+----------+----------+----------+----------+----------+----------+
| id|X_PAT_1_01|X_PAT_1_02|X_PAT_1_03|X_PAT_2_01|X_PAT_2_02|X_PAT_2_03|
+----+----------+----------+----------+----------+----------+----------+
|fff9| IC1| SUPPORT1| TE1| IC2| SUPPORT2| TE2|
+----+----------+----------+----------+----------+----------+----------+
雖然這涉及到 for 循環,但由於操作是直接在 dataframe 上執行的(無需收集/轉換為 RDD),您應該不會遇到任何問題。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.