I am trying to handle parquet tables from hive in Python and facing some data types issues. For Eg, if I have aa field in my hive parquet table as
decimal (10,2)
, its giving a junk value while I am trying to read the file in python. Please give some inputs on this.
I thought this might help a bit, although it isn't a proper answer. I have this method in my PySpark job before I store to Parquet, for example, to convert decimals to floats so they read ok in Pandas DataFrames. In this case I am shrinking the types but you get the idea:
def shrink_types(df):
"""Reduce data size by shrinking the types"""
# Loop through the data type tuples and downcast the column
for t in df.dtypes:
column_name = t[0]
column_type = t[1]
if column_type == 'double' or 'decimal' in column_type:
df = df.withColumn(
column_name,
F.col(column_name).cast('float')
)
return df
Then I call it via:
equities_df = shrink_types(equities_df)
# Save and restore so it actually runs
equities_df.write.mode('overwrite').parquet(
path='s3://bucket/path/dataset.parquet',
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.