简体   繁体   English

从 PySpark DataFrame 中的 XML 字段创建新列

[英]Create new columns from XML field within PySpark DataFrame

Within my DataFrame object I have a column Foos , as an example在我的 DataFrame 对象中,我有一列Foos ,例如

<?xml version="1.0" encoding="utf-8"?> <foos> <foo id="123" X="58" Y="M" /> <foos id="456" X="29" Y="M" /> <foos id="789" X="44" Y="F" /> </foos>

Each <foo> has a foo id , X and Y attribute that I want to create a column for each.每个<foo>都有一个foo idXY属性,我想为每个属性创建一列。

How can I parse the XML such that I can create new columns for each attribute?如何解析 XML 以便为每个属性创建新列? Does this require a UDF for each attribute, or is it possible to extract all three into separate columns in one function?这是否需要每个属性都有一个 UDF,或者是否可以在一个函数中将所有三个提取到单独的列中?

So far I receive an error with:到目前为止,我收到一个错误:

parsed = (lambda x: ET.fromstring(x).find('X').text)
udf = udf(parsed)
parsed_df = df.withColumn("X Column", udf("Foos"))

As mck suggested the xml doesn't look a correct one , you can install a maven package - com.databricks:spark-xml_2.11:0.10.0 and direcrly read a xml file using spark.read由于mck建议xml看起来不正确,您可以安装一个 maven 包 - com.databricks:spark-xml_2.11:0.10.0并使用spark.read直接读取 xml 文件

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rowTag", "foos") \
    .load("/FileStore/tables/test.xml")
df.show(truncate=False)

and this is what I am getting as per your xml file provided , you might need to look into the xml file这就是我根据您提供的 xml 文件得到的信息,您可能需要查看 xml 文件

+--------------+--------------------------------+
|foo           |foos                            |
+--------------+--------------------------------+
|[, 58, M, 123]|[[, 29, M, 456], [, 44, F, 789]]|
+--------------+--------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM