简体繁体中英

In Palantir Foundry how do I parse a very large csv file with OOMing the driver or executor?

原文 2022-05-17 18:54:00 6 1 pyspark/ palantir-foundry/ foundry-code-repositories/ foundry-code-workbooks

Similar to How do I parse large compressed csv files in Foundry? but without the file being compressed, a system generated (>10GB) csv file which needs to be parsed as a Foundry Dataset.

A dataset this size normally causes the driver to OOM, so how can I parse this file?

1 answers

Using the filesystem, you can read the file and yield a rowwise operation to split on each seperator ( , ) in this case.

df = raw_dataset
fs = df.filesystem()
def process_file(fl):
    with fs.open("data_pull.csv", "r") as f:
        header = [x.strip() for x in f.readline().split(",")]
        Log = Row(*header)
        for i in f:
            yield Log(*i.split(","))
rdd = fs.files().rdd
rdd = rdd.flatMap(process_file)
df = rdd.toDF()

How do I parse xml documents in Palantir Foundry?

How do I parse large compressed csv files in Foundry?

How do I write case statements in Pyspark using Palantir Foundry

How do I set value to 0 of column with multiple statements in Pyspark and Palantir Foundry

How to read from and write to the same file in Palantir Foundry?

In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?

How to test a transformation in Palantir Foundry?

How do I union two datasets in Palantir Foundry within a code workbook?

How do I JOIN two datasets in Palantir Foundry within a code workbook?

How do I check a column always has the same value in Palantir Foundry?

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do I parse xml documents in Palantir Foundry? How do I parse large compressed csv files in Foundry? How do I write case statements in Pyspark using Palantir Foundry How do I set value to 0 of column with multiple statements in Pyspark and Palantir Foundry How to read from and write to the same file in Palantir Foundry? In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements? How to test a transformation in Palantir Foundry? How do I union two datasets in Palantir Foundry within a code workbook? How do I JOIN two datasets in Palantir Foundry within a code workbook? How do I check a column always has the same value in Palantir Foundry?

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM