简体   繁体   中英

Handling Parquet Files in Python

I am trying to handle parquet tables from hive in Python and facing some data types issues. For Eg, if I have aa field in my hive parquet table as

decimal (10,2) , its giving a junk value while I am trying to read the file in python. Please give some inputs on this.

I thought this might help a bit, although it isn't a proper answer. I have this method in my PySpark job before I store to Parquet, for example, to convert decimals to floats so they read ok in Pandas DataFrames. In this case I am shrinking the types but you get the idea:

def shrink_types(df):
    """Reduce data size by shrinking the types"""

    # Loop through the data type tuples and downcast the column
    for t in df.dtypes:
        column_name = t[0]
        column_type = t[1]

        if column_type == 'double' or 'decimal' in column_type:
            df = df.withColumn(
                column_name,
                F.col(column_name).cast('float')
            )

return df

Then I call it via:

equities_df = shrink_types(equities_df)

# Save and restore so it actually runs
equities_df.write.mode('overwrite').parquet(
    path='s3://bucket/path/dataset.parquet',
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM