简体   繁体   中英

Replace value with value from column above in Pyspark

I have a databasa that comes from an Excel spreadsheet that was formatted in such a way that, when converted to CSV, the country column and the year column didn't quite align. Like so:

+----+----+
|   C|   Y|
+----+----+
|  BR|1995|
|null|1997|
|null|1999|
|null|2001|
|null|2003|
|null|2005|
|  NO|1995|
|null|1997|
|null|1999|
|null|2001|
|null|2003|
|null|2005|
|  RO|1995|
|null|1997|
|null|1999|
|null|2001|
|null|2003|
|null|2005|
|  AC|1995|
|null|1997|
+----+----+

In theory, the solution is simple, I should just verify if a value in the C column is null, and if it is, replace it with the value from the column above. My question is, how do I do that?

You should be able to do this with Pandas. Here's an example:

import pandas as pd

df = pd.read_csv('sample.csv')

for i in range(len(df)):
    if (pd.isnull(df.loc[i, 'C'])):
        df.loc[i, 'C'] = df.loc[i-1, 'C']

df.to_csv('new_sample.csv', index=False)

Replace 'sample.csv' with your actual csv file path.

What this is doing is iterating through a dataframe of your csv file and checking if the row within the C column is null or not. If it is, then it will set the value to the previous row's value. Since there is an initial row with the country's name that precedes the first instance of a null value, this should effectively fill the null rows with the correct country name.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM