简体   繁体   English

在 Palantir Foundry 中,如何使用 OOMing 驱动程序或执行程序解析一个非常大的 csv 文件?

[英]In Palantir Foundry how do I parse a very large csv file with OOMing the driver or executor?

Similar to How do I parse large compressed csv files in Foundry?类似于How do I parse largecompressed csv files in Foundry? but without the file being compressed, a system generated (>10GB) csv file which needs to be parsed as a Foundry Dataset.但在没有压缩文件的情况下,系统生成的 (>10GB) csv 文件需要被解析为 Foundry 数据集。

A dataset this size normally causes the driver to OOM, so how can I parse this file?这种大小的数据集通常会导致驱动程序OOM,那么我该如何解析这个文件呢?

Using the filesystem, you can read the file and yield a rowwise operation to split on each seperator ( , ) in this case.在这种情况下,使用文件系统,您可以读取文件并产生按行操作以拆分每个分隔符 ( , )。

df = raw_dataset
fs = df.filesystem()
def process_file(fl):
    with fs.open("data_pull.csv", "r") as f:
        header = [x.strip() for x in f.readline().split(",")]
        Log = Row(*header)
        for i in f:
            yield Log(*i.split(","))
rdd = fs.files().rdd
rdd = rdd.flatMap(process_file)
df = rdd.toDF()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Palantir Foundry 中解析 xml 文档? - How do I parse xml documents in Palantir Foundry? 如何在 Foundry 中解析大型压缩 csv 文件? - How do I parse large compressed csv files in Foundry? 如何使用 Palantir Foundry 在 Pyspark 中编写 case 语句 - How do I write case statements in Pyspark using Palantir Foundry 如何在 Pyspark 和 Palantir Foundry 中使用多个语句将列的值设置为 0 - How do I set value to 0 of column with multiple statements in Pyspark and Palantir Foundry 如何在 Palantir Foundry 中读取和写入同一个文件? - How to read from and write to the same file in Palantir Foundry? 在 Palantir Foundry 中,由于无法使用打印语句,我该如何调试 pyspark(或 pandas)UDF? - In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements? 如何在 Palantir Foundry 中测试转换? - How to test a transformation in Palantir Foundry? 如何在代码工作簿中合并 Palantir Foundry 中的两个数据集? - How do I union two datasets in Palantir Foundry within a code workbook? 如何在代码工作簿中加入 Palantir Foundry 中的两个数据集? - How do I JOIN two datasets in Palantir Foundry within a code workbook? 如何在 Palantir Foundry 中检查列是否始终具有相同的值? - How do I check a column always has the same value in Palantir Foundry?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM