[英]Create new columns from XML field within PySpark DataFrame
Within my DataFrame object I have a column Foos
, as an example在我的 DataFrame 对象中,我有一列Foos
,例如
<?xml version="1.0" encoding="utf-8"?> <foos> <foo id="123" X="58" Y="M" /> <foos id="456" X="29" Y="M" /> <foos id="789" X="44" Y="F" /> </foos>
Each <foo>
has a foo id
, X
and Y
attribute that I want to create a column for each.每个<foo>
都有一个foo id
、 X
和Y
属性,我想为每个属性创建一列。
How can I parse the XML such that I can create new columns for each attribute?如何解析 XML 以便为每个属性创建新列? Does this require a UDF for each attribute, or is it possible to extract all three into separate columns in one function?这是否需要每个属性都有一个 UDF,或者是否可以在一个函数中将所有三个提取到单独的列中?
So far I receive an error with:到目前为止,我收到一个错误:
parsed = (lambda x: ET.fromstring(x).find('X').text)
udf = udf(parsed)
parsed_df = df.withColumn("X Column", udf("Foos"))
As mck suggested the xml doesn't look a correct one , you can install a maven package - com.databricks:spark-xml_2.11:0.10.0
and direcrly read a xml file using spark.read
由于mck建议xml看起来不正确,您可以安装一个 maven 包 - com.databricks:spark-xml_2.11:0.10.0
并使用spark.read
直接读取 xml 文件
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rowTag", "foos") \
.load("/FileStore/tables/test.xml")
df.show(truncate=False)
+--------------+--------------------------------+
|foo |foos |
+--------------+--------------------------------+
|[, 58, M, 123]|[[, 29, M, 456], [, 44, F, 789]]|
+--------------+--------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.