如何在 Palantir Foundry 中解析 xml 文档？

Question

I have a set of .xml documents that I want to parse.我有一组要解析的.xml文档。

I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:我以前曾尝试使用获取文件内容并将它们转储到单个单元格中的方法来解析它们，但是我注意到这在实践中不起作用，因为我看到运行时间越来越慢，通常需要完成一项任务运行数十小时：

The first transform of mine takes the .xml contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document.我的第一个转换采用.xml内容并将其放入单个单元格，第二个转换采用此字符串并使用 Python 的xml库将字符串解析为文档。 This document I'm then able to extract properties from and return a DataFrame.然后我可以从该文档中提取属性并返回 DataFrame。

I'm using a UDF to conduct the process of mapping the string contents to the fields I want.我正在使用UDF来执行将字符串内容映射到我想要的字段的过程。

How can I make this faster / work better with large .xml files?如何使用大型.xml文件使其更快/更好地工作？

Answer 1

For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.对于这个问题，我们将结合几种不同的技术来使这段代码既可测试又具有高度可扩展性。

Theory理论

When parsing raw files, you have a couple of options you can consider:解析原始文件时，您可以考虑以下几个选项：

❌ You can write your own parser to read bytes from files and convert them into data Spark can understand. ❌ 您可以编写自己的解析器来从文件中读取字节并将它们转换为 Spark 可以理解的数据。
- This is highly discouraged whenever possible due to the engineering time and unscalable architecture.由于工程时间和不可扩展的架构，尽可能不鼓励这样做。 It doesn't take advantage of distributed compute when you do this as you must bring the entire raw file to your parsing method before you can use it.当您这样做时，它不会利用分布式计算，因为您必须将整个原始文件带到您的解析方法中，然后才能使用它。 This is not an effective use of your resources.这不是对资源的有效利用。
⚠ You can use your own parser library not made for Spark, such as the XML Python library mentioned in the question ⚠ 您可以使用自己的非 Spark 解析器库，例如问题中提到的 XML Python 库
- While this is less difficult to accomplish than writing your own parser, it still does not take advantage of distributed computation in Spark.虽然这比编写自己的解析器更容易实现，但它仍然没有利用 Spark 中的分布式计算。 It is easier to get something running, but it will eventually hit a limit of performance because it does not take advantage of low-level Spark functionality only exposed when writing a Spark library.运行起来更容易，但最终会遇到性能限制，因为它没有利用仅在编写 Spark 库时才公开的低级 Spark 功能。
✅ You can use a Spark-native raw file parser ✅ 你可以使用 Spark 原生的原始文件解析器
- This is the preferred option in all cases as it takes advantage of low-level Spark functionality and doesn't require you to write your own code.这是所有情况下的首选选项，因为它利用了低级 Spark 功能，并且不需要您编写自己的代码。 If a low-level Spark parser exists, you should use it.如果存在低级 Spark 解析器，则应使用它。

In our case, we can use the Databricks parser to great effect.在我们的例子中，我们可以使用 Databricks 解析器来获得很好的效果。

In general, you should also avoid using the .udf method as it likely is being used instead of good functionality already available in the Spark API.通常，您还应该避免使用.udf方法，因为它可能正在使用，而不是 Spark API 中已有的良好功能。 UDFs are not as performant as native methods and should be used only when no other option is available. UDF 的性能不如本机方法，只有在没有其他选项可用时才应使用。

A good example of UDFs covering up hidden problems would be string manipulations of column contents; UDF 掩盖隐藏问题的一个很好的例子是对列内容的字符串操作； while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.虽然从技术上讲，您可以使用 UDF 来执行拆分和修剪字符串等操作，但这些内容已经存在于Spark API中，并且比您自己的代码快几个数量级。

Design设计

Our design is going to use the following:我们的设计将使用以下内容：

Low-level Spark-optimized file parsing done via the Databricks XML Parser通过Databricks XML 解析器完成的低级 Spark 优化文件解析
Test-driven raw file parsing as explained here测试驱动的原始文件解析，如此处所述

Wire the Parser连接解析器

First, we need to add the .jar to our spark_session available inside Transforms.首先，我们需要将.jar添加到 Transforms 中可用的spark_session 。 Thanks to recent improvements, this argument, when configured, will allow you to use the .jar in both Preview/Test and at full build time.由于最近的改进，此参数在配置后将允许您在预览/测试和完整构建时使用.jar 。 Previously, this would have required a full build but not so now.以前，这需要一个完整的构建，但现在不需要。

We need to go to our transforms-python/build.gradle file and add 2 blocks of config:我们需要 go 到我们的transforms-python/build.gradle文件并添加 2 个配置块：

Enable the pytest plugin启用pytest插件
Enable the condaJars argument and declare the .jar dependency启用condaJars参数并声明.jar依赖项

My /transforms-python/build.gradle now looks like the following:我的/transforms-python/build.gradle现在如下所示：

buildscript {
    repositories {
       // some other things
    }

    dependencies {
        classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
    }
}

apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'

dependencies {
    condaJars "com.databricks:spark-xml_2.13:0.14.0"
}

// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'

// ... some other awesome features you should enable

After applying this config, you'll want to restart your Code Assist session by clicking on the bottom ribbon and hitting Refresh应用此配置后，您需要通过单击底部功能区并点击Refresh来重新启动 Code Assist session

After refreshing Code Assist, we now have low-level functionality available to parse our .xml files, now we need to test it!刷新 Code Assist 后，我们现在可以使用低级功能来解析我们的.xml文件，现在我们需要对其进行测试！

Testing the Parser测试解析器

If we adopt the same style of test-driven development as here , we end up with /transforms-python/src/myproject/datasets/xml_parse_transform.py with the following contents:如果我们采用与这里相同的测试驱动开发风格，我们最终会得到/transforms-python/src/myproject/datasets/xml_parse_transform.py ，其内容如下：

from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many


def read_files(spark_session, paths):
    parsed_dfs = []
    for file_name in paths:
        parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
        parsed_dfs += [parsed_df]
    output_df = union_many(*parsed_dfs, how="wide")
    return output_df


@transform(
    the_output=Output("my.awesome.output"),
    the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
    session = ctx.spark_session
    input_filesystem = the_input.filesystem()
    hadoop_path = input_filesystem.hadoop_path
    files = [hadoop_path + file_name for file_name in input_filesytem.ls("**/*.xml")]
    output_df = read_files(session, files)
    the_output.write_dataframe(output_df)

... an example file /transforms-python/test/myproject/datasets/sample.xml with contents: ...一个示例文件/transforms-python/test/myproject/datasets/sample.xml内容：

<tag>
<field1>
my_value
</field1>
</tag>

And a test file /transforms-python/test/myproject/datasets/test_xml_parse_transform.py :还有一个测试文件/transforms-python/test/myproject/datasets/test_xml_parse_transform.py ：

from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename


def test_parse_xml(spark_session):
    file_path = resource_filename(__name__, "sample.xml")
    parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
    assert parsed_df.count() == 1
    assert set(parsed_df.columns) == {"field1"}

We now have:我们现在有：

A distributed-compute, low-level .xml parser that is highly scalable一种分布式计算的低级.xml解析器，具有高度可扩展性
A test-driven setup that we can quickly iterate on to get our exact functionality right一个测试驱动的设置，我们可以快速迭代以使我们的确切功能正确

Cheers干杯

如何在 Palantir Foundry 中解析 xml 文档？

问题描述

1 个解决方案

解决方案1
1 2021-12-03 20:55:09

Theory理论

Design设计

Wire the Parser连接解析器

Testing the Parser测试解析器

如何在 Palantir Foundry 中解析 xml 文档？

问题描述

1 个解决方案

解决方案1 1 2021-12-03 20:55:09

Theory理论

Design设计

Wire the Parser连接解析器

Testing the Parser测试解析器

解决方案1
1 2021-12-03 20:55:09