I'm new to Dask and I'm trying to append to a parquet file.... but my code consistently overwrites the contents of the file?
Any ideas what I'm doing wrong here?
print("Write dataframe 1...")
df = pd.DataFrame({'DeptId': [1, 2, 3], 'DName': ['Accounting', 'Sales', 'Finance'], 'DeptNo': [100, 200, 300]})
df.set_index(['DeptId'], inplace=True)
ddf = dd.from_pandas(df, chunksize=1000)
print(ddf.head(3))
file_name = 'C:/Temp/xxx'
ddf.to_parquet(path=file_name, engine="pyarrow")
print("\nAppend dataframe 2...")
df2 = pd.DataFrame({'DeptId': [4, 5, 6], 'DName': ['Engineering', 'Support', 'Consulting'],
'DeptNo': [400, 500, 600]})
df2.set_index(['DeptId'], inplace=True)
ddf2 = dd.from_pandas(df2, chunksize=1000)
print(ddf2.head(3))
ddf2.to_parquet(path=file_name, engine="pyarrow", ignore_divisions=True, append=True, overwrite=False)
print("\nResulting parquet file...")
ddf3 = dd.read_parquet(path=file_name, engine="pyarrow")
print(ddf3.head())
The output is as follows...
DName DeptNo
DeptId
1 Accounting 100
2 Sales 200
3 Finance 300
DName DeptNo
DeptId
4 Engineering 400
5 Support 500
6 Consulting 600
DName DeptNo
DeptId
4 Engineering 400
5 Support 500
6 Consulting 600
python 3.8.8
dask 2020.3.1
pandas 1.2.3
pyarrow 3.0.0
Regards
MarkR
What happens is that each file will be loaded as a separate partition, so when you run .head()
it will look for values in the first partition only. In your case, you want to see all the obsevations so try one of the options below:
print(ddf3.head(npartitions=2)) # note this will show only first 5 rows
# or
print(ddf3.head(6, npartitions=2)) # this will show first 6 rows (all of the sample data)
# or
print(ddf3.compute()) # another way to see all of the sample data
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.