[英]How do I parse xml documents in Palantir Foundry?
I have a set of .xml
documents that I want to parse.我有一组要解析的
.xml
文档。
I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:我以前曾尝试使用获取文件内容并将它们转储到单个单元格中的方法来解析它们,但是我注意到这在实践中不起作用,因为我看到运行时间越来越慢,通常需要完成一项任务运行数十小时:
The first transform of mine takes the .xml
contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document.我的第一个转换采用
.xml
内容并将其放入单个单元格,第二个转换采用此字符串并使用 Python 的xml库将字符串解析为文档。 This document I'm then able to extract properties from and return a DataFrame.然后我可以从该文档中提取属性并返回 DataFrame。
I'm using a UDF to conduct the process of mapping the string contents to the fields I want.我正在使用UDF来执行将字符串内容映射到我想要的字段的过程。
How can I make this faster / work better with large .xml
files?如何使用大型
.xml
文件使其更快/更好地工作?
For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.对于这个问题,我们将结合几种不同的技术来使这段代码既可测试又具有高度可扩展性。
When parsing raw files, you have a couple of options you can consider:解析原始文件时,您可以考虑以下几个选项:
In our case, we can use the Databricks parser to great effect.在我们的例子中,我们可以使用 Databricks 解析器来获得很好的效果。
In general, you should also avoid using the .udf
method as it likely is being used instead of good functionality already available in the Spark API.通常,您还应该避免使用
.udf
方法,因为它可能正在使用,而不是 Spark API 中已有的良好功能。 UDFs are not as performant as native methods and should be used only when no other option is available. UDF 的性能不如本机方法,只有在没有其他选项可用时才应使用。
A good example of UDFs covering up hidden problems would be string manipulations of column contents; UDF 掩盖隐藏问题的一个很好的例子是对列内容的字符串操作; while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.
虽然从技术上讲,您可以使用 UDF 来执行拆分和修剪字符串等操作,但这些内容已经存在于Spark API中,并且比您自己的代码快几个数量级。
Our design is going to use the following:我们的设计将使用以下内容:
First, we need to add the .jar
to our spark_session
available inside Transforms.首先,我们需要将
.jar
添加到 Transforms 中可用的spark_session
。 Thanks to recent improvements, this argument, when configured, will allow you to use the .jar
in both Preview/Test and at full build time.由于最近的改进,此参数在配置后将允许您在预览/测试和完整构建时使用
.jar
。 Previously, this would have required a full build but not so now.以前,这需要一个完整的构建,但现在不需要。
We need to go to our transforms-python/build.gradle
file and add 2 blocks of config:我们需要 go 到我们的
transforms-python/build.gradle
文件并添加 2 个配置块:
pytest
pluginpytest
插件condaJars
argument and declare the .jar
dependencycondaJars
参数并声明.jar
依赖项My /transforms-python/build.gradle
now looks like the following:我的
/transforms-python/build.gradle
现在如下所示:
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
After applying this config, you'll want to restart your Code Assist session by clicking on the bottom ribbon and hitting Refresh
应用此配置后,您需要通过单击底部功能区并点击
Refresh
来重新启动 Code Assist session
After refreshing Code Assist, we now have low-level functionality available to parse our .xml
files, now we need to test it!刷新 Code Assist 后,我们现在可以使用低级功能来解析我们的
.xml
文件,现在我们需要对其进行测试!
If we adopt the same style of test-driven development as here , we end up with /transforms-python/src/myproject/datasets/xml_parse_transform.py
with the following contents:如果我们采用与这里相同的测试驱动开发风格,我们最终会得到
/transforms-python/src/myproject/datasets/xml_parse_transform.py
,其内容如下:
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
@transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + file_name for file_name in input_filesytem.ls("**/*.xml")]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
... an example file /transforms-python/test/myproject/datasets/sample.xml
with contents: ...一个示例文件
/transforms-python/test/myproject/datasets/sample.xml
内容:
<tag>
<field1>
my_value
</field1>
</tag>
And a test file /transforms-python/test/myproject/datasets/test_xml_parse_transform.py
:还有一个测试文件
/transforms-python/test/myproject/datasets/test_xml_parse_transform.py
:
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
We now have:我们现在有:
.xml
parser that is highly scalable.xml
解析器,具有高度可扩展性Cheers干杯
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.