简体   繁体   中英

Error Appending to Parquet File Using Dask Dataframe

I'm new to Dask and I'm trying to append to a parquet file.... but my code consistently overwrites the contents of the file?

Any ideas what I'm doing wrong here?

print("Write dataframe 1...")
df = pd.DataFrame({'DeptId': [1, 2, 3], 'DName': ['Accounting', 'Sales', 'Finance'], 'DeptNo': [100, 200, 300]})
df.set_index(['DeptId'], inplace=True)
ddf = dd.from_pandas(df, chunksize=1000)
print(ddf.head(3))
file_name = 'C:/Temp/xxx'
ddf.to_parquet(path=file_name, engine="pyarrow")

print("\nAppend dataframe 2...")
df2 = pd.DataFrame({'DeptId': [4, 5, 6], 'DName': ['Engineering', 'Support', 'Consulting'],
                    'DeptNo': [400, 500, 600]})
df2.set_index(['DeptId'], inplace=True)
ddf2 = dd.from_pandas(df2, chunksize=1000)
print(ddf2.head(3))
ddf2.to_parquet(path=file_name, engine="pyarrow", ignore_divisions=True, append=True, overwrite=False)

print("\nResulting parquet file...")
ddf3 = dd.read_parquet(path=file_name, engine="pyarrow")
print(ddf3.head()) 

The output is as follows...

  • Write dataframe 1...
             DName  DeptNo
DeptId                    
1       Accounting     100
2            Sales     200
3          Finance     300
  • Append dataframe 2...
              DName  DeptNo
DeptId                     
4       Engineering     400
5           Support     500
6        Consulting     600
  • Resulting parquet file...
              DName  DeptNo
DeptId                     
4       Engineering     400
5           Support     500
6        Consulting     600
  • I'm using this versions
python  3.8.8
dask    2020.3.1
pandas  1.2.3
pyarrow 3.0.0

Regards

MarkR

What happens is that each file will be loaded as a separate partition, so when you run .head() it will look for values in the first partition only. In your case, you want to see all the obsevations so try one of the options below:

print(ddf3.head(npartitions=2)) # note this will show only first 5 rows

# or

print(ddf3.head(6, npartitions=2)) # this will show first 6 rows (all of the sample data)

# or

print(ddf3.compute()) # another way to see all of the sample data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM