Tag[pyspark-pandas] Recent Newest Questions

Generate subsample based on age using PySpark

I wanted to collect sample based on age with a condition on the Failure status. I am interested in 3 days old serial number. However, I don't need hea ...

How to create this function in PySpark?

I have a large data frame, consisting of 400+ columns and 14000+ records, that I need to clean. I have defined a python code to do this, but due to th ...

pandas_udf with pd.Series and other object as arguments

I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dat ...

How to replace any null in pyspark df with value from the below row, same column

Let's say I have a pyspark DF: | Column A | Column B | | -------- | -------- | | val1 | val1B | | null | val2B | | val2 | null ...

How to replace text in column by the value contained in the columns named in this text

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc co ...

Dropping rows in PySpark based on indexes

I'm working with a PySpark Pandas DataFrame that looks similar to this: The total dataset is quite a bit larger (approx. 55 mill rows), so this exa ...

How do I reduce the run-time for Big Data PySpark scripts?

I am currently working on a project in Databricks with approximately 6 GiB's of data in a single table, so you can imagine that the run-time on a tabl ...

Delete rows on the basis of another data frame if the data matched and insert new data

I have two files one is file1.csv and another one is file2.csv I have put file1 data in one dataframe and when second file file2.csv will arrive then ...

Pandas API on Spark - Difference between two date columns

I want the difference between two date columns in the number of days. In pandas dataframe difference in two "datetime64" type columns returns number ...

AttachDistributedSequence is not supported in Unity Catalog

I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error: The table was created ...

Index with groupby PySpark

I'm trying to translate the below pandas code to PySpark. But I'm having trouble with these two points: But there is index in Spark DataFrame? H ...

Update a specific value when 2 other values matches from 2 different tables in PySpark

Any idea how to write this in PySpark? I have two PySpark DataFrames that i'm trying to union. However, there is 1 value that I want to update based ...

Pandas UDF with dictionary lookup and conditionals

I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as n ...

How to add a column based on a function to Pandas on Spark DataFrame?

I would like to run udf on Pandas on Spark dataframe. I thought it should be easy but having tough time figuring it out. For example, consider my psd ...

Is there a way to group by lambda function in pyspark pandas

I originally used the below code to work with a standard pandas df. Switched to pyspark pandas df once data grew. I've been unable to make this groupb ...

PicklingError: Could not serialize object (happens only for large datasets)

Context: I am using pyspark.pandas in a databricks jupyter notebook. What I have tested: I do not get any error if: I run my code on 300 rows of ...

HI I am trying to iterate over pyspark data frame without using spark_df.collect()

HI I am trying to iterate over pyspark data frame without using spark_df.collect() and I am trying foreach and map method is there any other way to it ...

Function to take a list of spark dataframe and convert to pandas then csv

This is what I wrote, but I actually want the function to take this list and convert every df into a pandas df and then convert it to csv and save i ...

Create Rows based on Column

I want to create a row based on a column. For example - I have the following data frame. I want to convert it to the following Where the altern ...

How to filter out rows with lots of conditions in pyspark?

Let's say that these are my data: The problem is that sometimes, there are more than one Product_Number while it should be unique. What I am trying ...