Replace value with value from column above in Pyspark

Question

I have a databasa that comes from an Excel spreadsheet that was formatted in such a way that, when converted to CSV, the country column and the year column didn't quite align. Like so:

+----+----+
|   C|   Y|
+----+----+
|  BR|1995|
|null|1997|
|null|1999|
|null|2001|
|null|2003|
|null|2005|
|  NO|1995|
|null|1997|
|null|1999|
|null|2001|
|null|2003|
|null|2005|
|  RO|1995|
|null|1997|
|null|1999|
|null|2001|
|null|2003|
|null|2005|
|  AC|1995|
|null|1997|
+----+----+

In theory, the solution is simple, I should just verify if a value in the C column is null, and if it is, replace it with the value from the column above. My question is, how do I do that?

Answer 1

You should be able to do this with Pandas. Here's an example:

import pandas as pd

df = pd.read_csv('sample.csv')

for i in range(len(df)):
    if (pd.isnull(df.loc[i, 'C'])):
        df.loc[i, 'C'] = df.loc[i-1, 'C']

df.to_csv('new_sample.csv', index=False)

Replace 'sample.csv' with your actual csv file path.

What this is doing is iterating through a dataframe of your csv file and checking if the row within the C column is null or not. If it is, then it will set the value to the previous row's value. Since there is an initial row with the country's name that precedes the first instance of a null value, this should effectively fill the null rows with the correct country name.

Replace value with value from column above in Pyspark

Question

1 answers

solution1
0 ACCPTED 2022-07-13 00:36:47

Replace value with value from column above in Pyspark

Question

1 answers

solution1 0 ACCPTED 2022-07-13 00:36:47

solution1
0 ACCPTED 2022-07-13 00:36:47