How to read parquet files from remote HDFS in python using Dask/ pyarrow

Question

Please help me with reading parquet files from remote HDFS ie; setup on Linux server using Dask or pyarrow in python?

Also suggest me if there are better ways to do the same other than the above two options.

Tried following code

from dask import dataframe as dd
df = dd.read_parquet('webhdfs://10.xxx.xx.xxx:xxxx/home/user/dir/sample.parquet',engine='pyarrow',storage_options={'host': '10.xxx.xx.xxx', 'port': xxxx, 'user': 'xxxxx'})
print(df)

Error is

KeyError: "Collision between inferred and specified storage options:\n- 'host'\n- 'port'"

Answer 1

Looking at this post here: https://github.com/dask/dask/issues/2757

Have you tried using 3 slashes?

df = dd.read_parquet('webhdfs:///10.xxx.xx.xxx:xxxx/home/user/dir/sample.parquet',engine='pyarrow',storage_options={'host': '10.xxx.xx.xxx', 'port': xxxx, 'user': 'xxxxx'})

Answer 2

You need to either provide the host/port in the URL or in the kwargs, not both. The following should both work:

df = dd.read_parquet('webhdfs://10.xxx.xx.xxx:xxxx/home/user/dir/sample.parquet',
    engine='pyarrow', storage_options={'user': 'xxxxx'})

df = dd.read_parquet('webhdfs:///home/user/dir/sample.parquet',
    engine='pyarrow', storage_options={'host': '10.xxx.xx.xxx', 'port': xxxx, 'user': 'xxxxx'})

How to read parquet files from remote HDFS in python using Dask/ pyarrow

Question

2 answers

solution1
0 2020-07-23 07:31:34

solution2
0 2020-07-23 15:00:45

How to read parquet files from remote HDFS in python using Dask/ pyarrow

Question

2 answers

solution1 0 2020-07-23 07:31:34

solution2 0 2020-07-23 15:00:45

solution1
0 2020-07-23 07:31:34

solution2
0 2020-07-23 15:00:45