Error Appending to Parquet File Using Dask Dataframe

Question

I'm new to Dask and I'm trying to append to a parquet file.... but my code consistently overwrites the contents of the file?

Any ideas what I'm doing wrong here?

print("Write dataframe 1...")
df = pd.DataFrame({'DeptId': [1, 2, 3], 'DName': ['Accounting', 'Sales', 'Finance'], 'DeptNo': [100, 200, 300]})
df.set_index(['DeptId'], inplace=True)
ddf = dd.from_pandas(df, chunksize=1000)
print(ddf.head(3))
file_name = 'C:/Temp/xxx'
ddf.to_parquet(path=file_name, engine="pyarrow")

print("\nAppend dataframe 2...")
df2 = pd.DataFrame({'DeptId': [4, 5, 6], 'DName': ['Engineering', 'Support', 'Consulting'],
                    'DeptNo': [400, 500, 600]})
df2.set_index(['DeptId'], inplace=True)
ddf2 = dd.from_pandas(df2, chunksize=1000)
print(ddf2.head(3))
ddf2.to_parquet(path=file_name, engine="pyarrow", ignore_divisions=True, append=True, overwrite=False)

print("\nResulting parquet file...")
ddf3 = dd.read_parquet(path=file_name, engine="pyarrow")
print(ddf3.head())

The output is as follows...

Write dataframe 1...

             DName  DeptNo
DeptId                    
1       Accounting     100
2            Sales     200
3          Finance     300

Append dataframe 2...

              DName  DeptNo
DeptId                     
4       Engineering     400
5           Support     500
6        Consulting     600

Resulting parquet file...

              DName  DeptNo
DeptId                     
4       Engineering     400
5           Support     500
6        Consulting     600

I'm using this versions

python  3.8.8
dask    2020.3.1
pandas  1.2.3
pyarrow 3.0.0

Regards

MarkR

Answer 1

What happens is that each file will be loaded as a separate partition, so when you run .head() it will look for values in the first partition only. In your case, you want to see all the obsevations so try one of the options below:

print(ddf3.head(npartitions=2)) # note this will show only first 5 rows

# or

print(ddf3.head(6, npartitions=2)) # this will show first 6 rows (all of the sample data)

# or

print(ddf3.compute()) # another way to see all of the sample data

Error Appending to Parquet File Using Dask Dataframe

Question

1 answers

solution1
0 2021-04-07 19:06:02

Error Appending to Parquet File Using Dask Dataframe

Question

1 answers

solution1 0 2021-04-07 19:06:02

solution1
0 2021-04-07 19:06:02