Tag[foundry-python-transform] Recent Newest Questions

How do I identify the value of a skewed task of my Foundry job?

I've looked into my job and have identified that I do indeed have a skewed task. How do I determine what the actual value is inside this task that is ...

Shuffle Stage Failing Due To Executor Loss

I get the following error when my spark jobs fails **"org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 21), which maint ...

PySpark Serialized Results too Large OOM for loop in Spark

I have serious difficulties in understanding why I cannot run a transform which, after waiting so many minutes (sometimes hours), returns the error "S ...

Why is my Code Repo warning me not to use union and instead use unionByName?

I see in my repository it's warning me about using union and instead I should use unionByName. Aren't these the same thing? Why would I care which one ...

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs ...

Why don't I see log lines in my PySpark code when I would expect them to appear?

I have some PySpark code I'm writing where I want to execute joins and other operations, but I want to log when this phase is successfully completed. ...

Why is my Code Repo warning me about using withColumn in a for/while loop?

I'm noticing my code repo is warning me that using withColumn in a for/while loop is an antipattern. Why is this not recommended? Isn't this a normal ...

When would I prefer to run a job in static allocation vs. dynamic allocation?

I've read the docs in Foundry for what the differences are between the two, but I'm wondering in what circumstances I would want to apply the STATIC_A ...

Why don't I see smaller tasks for my requested repartitioning?

I have a dataset I want to repartition evenly into 10 buckets per unique value of a column, and I want to size this result into a large number of part ...

Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

I want to run df.count() on my DataFrame, but I know my total dataset size is pretty large. Does this run the risk of materializing the data back to t ...

How do I ensure consistent file sizes in datasets built in Foundry Python Transforms?

My Foundry transform is producing different amount of data on different runs, but I want to have similar amount of rows in each file. I can use DataFr ...

How do I parse xml documents in Palantir Foundry?

I have a set of .xml documents that I want to parse. I previously have tried to parse them using methods that take the file contents and dump them in ...

How do I add a column indicating the row number from a file on disk?

I want to parse a series of .csv files using spark.read.csv, but I want to include the row number of each line inside the file. I know that Spark typ ...

Is there a way to test PySpark Regex's?

I'd like to test different inputs to a PySpark regex to see if they fail/succeed before running a build. Is there a way to test this in Foundry before ...

How do I control the file counts inside my Hive-partitioned dataset?

I want to Hive-partition my dataset, but I don't quite know how to ensure the file counts in the splits are sane. I know I should roughly aim for file ...

How do I union many distinct schemas into a single output I can dynamically unpivot later?

I want to take an arbitrary set of schemas and combine them into a single dataset that can be unpivoted later. What is the most stable way to do this? ...

How can I merge an incremental dataset and a snapshot dataset while retaining deleted rows?

I have a data connection source that creates two datasets: Dataset X (Snapshot) Dataset Y (Incremental) The two datasets pull from the same s ...

Palantir Foundry incremental testing is hard to iterate on, how do I find bugs faster?

I have a pipeline setup in my Foundry instance that is using incremental computation but for some reason isn't doing what I expect. Namely, I want to ...

How do I parse large compressed csv files in Foundry?

I have a large gziped csv file (.csv.gz) uploaded to a dataset that's about 14GB in size and 40GB when uncompressed. Is there a way to decompress, rea ...

Why is my build hanging / taking a long time to generate my query plan with many unions?

I notice when I run the same code as my example over here but with a union or unionByName or unionAll instead of the join, my query planning takes sig ...