简体   繁体   中英

Perform preprocessing operations from pandas on Spark dataframe

I have a rather large CSV so I am using AWS EMR to read the data into a Spark dataframe to perform some operations. I have a pandas function that does some simple preprocessing:

def clean_census_data(df):
    """
    This function cleans the dataframe and drops columns that contain 70% NaN values
    """
    # Replace None or 0 with np.nan
    df = df.replace('None', np.nan)
    # Replace weird numbers
    df = df.replace(-666666666.0, np.nan)
    
    # Drop columns that contain 70% NaN or 0 values
    df = df.loc[:, df.isnull().mean() < .7]
    
    
    return df

I want to apply this function onto a Spark dataframe, but the functions are not the same. I am not familiar with Spark and performing these rather simple operations in pandas is not obvious to me how to perform the same operations in Spark. I know I can convert a Spark dataframe into pandas, but that does not seem very efficient.

First answer, so please be kind. This function should work with pyspark dataframes instead of pandas dataframes, and should give you similar results:

def clean_census_data(df):
    """
    This function cleans the dataframe and drops columns that contain 70% NaN values
    """
    # Replace None or 0 with np.nan
    df = df.replace('None', None)

    # Replace weird numbers
    df = df.replace(-666666666.0, None)

    # Drop columns that contain 70% NaN or 0 values
    selection_dict = df.select([(count(when(isnan(c) | col(c).isNull() | (col(c).cast('int') == 0), c))/count(c) > .7).alias(c) for c in df.columns]).first().asDict()
    columns_to_remove = [name for name, is_selected in selection_dict.items() if is_selected]
    df = df.drop(*columns_to_remove)

    return df

Attention: The resulting dataframe contains None instead of np.nan.

Native Spark functions can do such aggregation for every column.
The following dataframe contains the percentage of nulls, nans and zeros.

df2 = df1.select(
    [(F.count(F.when(F.isnan(c) | F.isnull(c) | (F.col(c) == 0), c))
     / F.count(F.lit(1))).alias(c) 
     for c in df1.columns]
)

With an example:

from pyspark.sql import functions as F
df1 = spark.createDataFrame(
    [(1000, 0, None),
     (None, 2, None),
     (None, 3, 2222),
     (None, 4, 2233),
     (None, 5, 2244)],
    ['c1', 'c2', 'c3'])

df2 = df1.select(
    [(F.count(F.when(F.isnan(c) | F.isnull(c) | (F.col(c) == 0), c))
     / F.count(F.lit(1))).alias(c) 
     for c in df1.columns]
)
df2.show()
# +---+---+---+
# | c1| c2| c3|
# +---+---+---+
# |0.8|0.2|0.4|
# +---+---+---+

What remains is just selecting the columns from df1:

df = df1.select([c for c in df1.columns if df2.head()[c] < .7])
df.show()
# +---+----+
# | c2|  c3|
# +---+----+
# |  0|null|
# |  2|null|
# |  3|2222|
# |  4|2233|
# |  5|2244|
# +---+----+

The percentage is calculated based on this condition, change it according to your needs:
F.isnan(c) | F.isnull(c) | (F.col(c) == 0)

This would replace None with np.nan:
df.fillna(np.nan)

This would replace specified value with np.nan:
df.replace(-666666666, np.nan)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM