Handling Parquet Files in Python

Question

I am trying to handle parquet tables from hive in Python and facing some data types issues. For Eg, if I have aa field in my hive parquet table as

decimal (10,2) , its giving a junk value while I am trying to read the file in python. Please give some inputs on this.

Answer 1

I thought this might help a bit, although it isn't a proper answer. I have this method in my PySpark job before I store to Parquet, for example, to convert decimals to floats so they read ok in Pandas DataFrames. In this case I am shrinking the types but you get the idea:

def shrink_types(df):
    """Reduce data size by shrinking the types"""

    # Loop through the data type tuples and downcast the column
    for t in df.dtypes:
        column_name = t[0]
        column_type = t[1]

        if column_type == 'double' or 'decimal' in column_type:
            df = df.withColumn(
                column_name,
                F.col(column_name).cast('float')
            )

return df

Then I call it via:

equities_df = shrink_types(equities_df)

# Save and restore so it actually runs
equities_df.write.mode('overwrite').parquet(
    path='s3://bucket/path/dataset.parquet',
)

Handling Parquet Files in Python

Question

1 answers

solution1
0 2020-11-12 17:31:00

Handling Parquet Files in Python

Question

1 answers

solution1 0 2020-11-12 17:31:00

solution1
0 2020-11-12 17:31:00