Tag[parquet] Recent Newest Questions

Querying last row of sorted column where value is less than specific amount from parquet file

I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below. I am interested in querying the las ...

ATHENA CREATE TABLE AS problem with parquet format

I'm creating a table in Athena and specifying the format as PARQUET however the file extension is not being recognized in S3. The type is displayed as ...

Read parquet in MLRun, "Unable to infer schema for Parquet. It must be specified manually."

I got this issue, when I ingested/wrote data to FeatureSet (part of MLRun FeatureStore) and than I read the data via PySpark (it seems as invalid parq ...

How can I use golang apache arrow library to read repeated field for parquet?

I am using apache arrow golang library to read parquet. No-repeated column seems straight forward, but how can I read repeated field? ...

Combining 2 parquets that are too large for memory together

Say I have two datasets stored as parquets that I want to combine. I can read them in, rbind them, then spit them back out into a parquet, like so: # ...

What is the point of providing schema in case of reading a parquet using pyspark?

When I am working with CSV, I can provide custom schema while reading a file, and the benefits I receive are as follows (along with the contrast with ...

datetime offset issue while saving data into parquet for day light saving time

screenshot from source and destinationWhile we are writing into parquet file using spark/scala, DST(day light saving times) times are auto converting ...

pyspark partitioning create an extra empty file for every partition

I am facing with one problem in Azure Databricks. In my notebook I am executing simple write command with partitioning: And I see something like th ...

How do I get page level data of a parquet file with pyarrow?

Given a ParquetFile object (docs) I am able to retrieve data at row group / column chunk level either with read_row_group or with the metadata attribu ...

write out spark df as single parquet files in databricks

i have a df something like below Filename col1 col2 file1 1 1 file1 ...

I cant read parquet file by pandas read_parquet function

when I use pd.read_parquet to read a parquet file this error is displayed my code: error: I want to convert this file to csv: https://d37ci ...

Arrow related error when pushing dataset to Hugging-face hub

. Answers to this question are eligible for a +50 reputation bounty. Ts ...

Why can Pyarrow read additional index column while Pandas dataframe cannot?

I have the following code: The output is: Just curious, why did Pandas dataframe ignore __null_dask_index__ column name? Or is __null_dask_index ...

Convert huge Polars dataframe to dict without consuming too much RAM

When I load my parquet file into a Polars DataFrame, it takes about 5.5 GB of RAM. Polars is great compared to other options I have tried. However, Po ...

CSV only having last line of the Dataframe

Unable to resolve this after looking at a bunch of similar answers. The CSV I have only has the last line of the Dataframe printing. I need the whol ...

Which Parquet file databricks will refer for schema while creating a table

I'm creating a external table in azure databricks on top of the ADLS parquet files using the syntax below. create table if not exists <table_name& ...

Best way to convert JSON to Apache Parquet format using aws

I've been working on a project where I've been storing the iot data in s3 bucket and batching them using aws kinesis firehose, i have a lambda functio ...

Writing files to Parquet format in Java?

While researching on writing files to Parquet in Java I came across - org.apache.parquet.hadoop.ParquetWriter org.apache.parquet.avro.AvroParque ...

with python, is there a way to load a polars dataframe directly into an s3 bucket as parquet

looking for something like this: Save Dataframe to csv directly to s3 Python the api shows these arguments: https://pola-rs.github.io/polars/py-pola ...

Reading Multiple folders parallely

i have multiple part folders each containing parquet files (ex given below). Now across a part-folder the schema can be different (either the num of c ...