简体   繁体   English

Palantir foundry 代码工作簿,从数据集中导出单个 xml

[英]Palantir foundry code workbook, export individual xmls from dataset

I have a dataset which have an xml column and i am trying to export individual xmls as files with filename being in another column using codeworkbook我有一个数据集,它有一个 xml 列,我正在尝试使用 codeworkbook 将单个 xml 导出为文件名在另一列中的文件

在此处输入图像描述

I filtered the rows i want using below code我使用下面的代码过滤了我想要的行

def prepare_input(xml_with_debug):
    from pyspark.sql import functions as F

    filter_column = "key"
    filter_value = "test_key"
    df_filtered = xml_with_debug.filter(filter_value == F.col(filter_column))

    approx_number_of_rows = 1
    sample_percent = float(approx_number_of_rows) / df_filtered.count()

    df_sampled = df_filtered.sample(False, sample_percent, seed=0)

    important_columns = ["key", "xml"]

    return df_sampled.select([F.col(c).cast(F.StringType()).alias(c) for c in important_columns])

It works till here.它工作到这里。 Now for the last part i tried this in a python task, but was complaining about the parameters (i should have set it up wrongly).现在对于最后一部分,我在 python 任务中尝试了这个,但抱怨参数(我应该错误地设置它)。 But even if it works it will be as a single file i think .但即使它有效,我认为它也将作为一个文件。

from transforms.api import transform, Input, Output
@transform(
     output=Output("/path/to/python_csv"),
     my_input=Input("/path/to/input")
)
def my_compute_function(output, my_input):
     output.write_dataframe(my_input.dataframe().coalesce(1), output_format="csv", options={"header": "true"})

I am trying to set it up in GUI like below我正在尝试在 GUI 中进行设置,如下所示

在此处输入图像描述

My question i guess is, what will be the code in the last Python task (write_file) after the prepare input so that i extract individual xmls (And if possible zip them into single file for download)我想我的问题是,在准备输入之后,最后一个 Python 任务(write_file)中的代码是什么,以便我提取单个 xml(如果可能的话,将它们 zip 放入单个文件以供下载)

You can access the output dataset filesystem and write files into it in whatever format you want.您可以访问 output 数据集文件系统并以您想要的任何格式将文件写入其中。
The documentation for that can be found here: https://www.palantir.com/docs/foundry/code-workbook/transforms-unstructured/#writing-files相关文档可在此处找到: https://www.palantir.com/docs/foundry/code-workbook/transforms-unstructured/#writing-files
(If you want to do it from a code repository it's very similar https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/#writing-files ) (如果您想从代码存储库中执行它,它非常相似https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/#writing-files

By doing that you can create multiple different files or you can create a single zip file and write it into a dataset.通过这样做,您可以创建多个不同的文件,或者您可以创建一个 zip 文件并将其写入数据集。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Palantir Foundry 的 Code Workbook 中使用 sparkcontext 创建一个空数据集? - How can I create an empty dataset using sparkcontext in Code Workbook in Palantir Foundry? 如何在 Palantir Foundry 中上传未经身份验证的数据集 - How to upload dataset without authentication in Palantir foundry 如何在代码工作簿中合并 Palantir Foundry 中的两个数据集? - How do I union two datasets in Palantir Foundry within a code workbook? 如何在代码工作簿中加入 Palantir Foundry 中的两个数据集? - How do I JOIN two datasets in Palantir Foundry within a code workbook? 是否可以对 Palantir Foundry 中的数据集应用手动保留? - Is it possible to apply manual retention on a dataset in Palantir Foundry? 在 Palantir Foundry 的代码工作簿中如何分配执行者? - How are executors assigned in Code Workbooks in Palantir Foundry? 如何从 Palantir Foundry 的 PySpark 模式创建一个空数据集? - How can I create an empty dataset from on a PySpark schema in Palantir Foundry? 在多个代码库中搜索关键字 - Palantir Foundry - Searching for keywords in multiple code repositories - Palantir Foundry 在 Slate 应用程序中显示存储在 Palantir Foundry 数据集中的 PDF 文件 - Displaying a PDF file stored in a Dataset on Palantir Foundry in Slate Application Palantir 铸造厂使用导入的数据集执行 nlp 操作,使用 pyspark - Palantir foundry using imported dataset to perfomr nlp operation using pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM