从 PySpark DataFrame 中的 XML 字段创建新列

Question

Within my DataFrame object I have a column Foos , as an example在我的 DataFrame 对象中，我有一列Foos ，例如

<?xml version="1.0" encoding="utf-8"?> <foos> <foo id="123" X="58" Y="M" /> <foos id="456" X="29" Y="M" /> <foos id="789" X="44" Y="F" /> </foos>

Each <foo> has a foo id , X and Y attribute that I want to create a column for each.每个<foo>都有一个foo id 、 X和Y属性，我想为每个属性创建一列。

How can I parse the XML such that I can create new columns for each attribute?如何解析 XML 以便为每个属性创建新列？ Does this require a UDF for each attribute, or is it possible to extract all three into separate columns in one function?这是否需要每个属性都有一个 UDF，或者是否可以在一个函数中将所有三个提取到单独的列中？

So far I receive an error with:到目前为止，我收到一个错误：

parsed = (lambda x: ET.fromstring(x).find('X').text)
udf = udf(parsed)
parsed_df = df.withColumn("X Column", udf("Foos"))

Answer 1

As mck suggested the xml doesn't look a correct one , you can install a maven package - com.databricks:spark-xml_2.11:0.10.0 and direcrly read a xml file using spark.read由于mck建议xml看起来不正确，您可以安装一个 maven 包 - com.databricks:spark-xml_2.11:0.10.0并使用spark.read直接读取 xml 文件

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rowTag", "foos") \
    .load("/FileStore/tables/test.xml")
df.show(truncate=False)

and this is what I am getting as per your xml file provided , you might need to look into the xml file这就是我根据您提供的 xml 文件得到的信息，您可能需要查看 xml 文件

+--------------+--------------------------------+
|foo           |foos                            |
+--------------+--------------------------------+
|[, 58, M, 123]|[[, 29, M, 456], [, 44, F, 789]]|
+--------------+--------------------------------+

从 PySpark DataFrame 中的 XML 字段创建新列

问题描述

1 个解决方案

解决方案1
0 2020-11-06 08:11:51

and this is what I am getting as per your xml file provided , you might need to look into the xml file这就是我根据您提供的 xml 文件得到的信息，您可能需要查看 xml 文件

从 PySpark DataFrame 中的 XML 字段创建新列

问题描述

1 个解决方案

解决方案1 0 2020-11-06 08:11:51

and this is what I am getting as per your xml file provided , you might need to look into the xml file这就是我根据您提供的 xml 文件得到的信息，您可能需要查看 xml 文件

解决方案1
0 2020-11-06 08:11:51