Dask dataframe.read_csv not work correctly with hdfs csv file

Question

I want read csv data from hdfs server, but it throws an Exception,like below:

    hdfsSeek(desiredPos=64000000): FSDataInputStream#seek error:
    java.io.EOFException: Cannot seek after EOF
    at 
    org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1602)
    at 
    org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)

My Python code:

from dask import dataframe as dd
df = dd.read_csv('hdfs://SER/htmpa/a.csv').head(n=3)

csv file:

    user_id,item_id,play_count
    0,0,500
    0,1,3
    0,3,1
    1,0,4
    1,3,1
    2,0,1
    2,1,1
    2,3,5
    3,0,1
    3,3,4
    4,1,1
    4,2,8
    4,3,4

Answer 1

Are you running within and IDE or a jupyter notebook?
We are running on a Cloudera distribution and also get a similar error. From what we understand it is not connected to dask but rather to our hadoop configuration.
In any case we successfully use the pyarrow library when accessing hdfs . be aware that if you need to access parquet files run with version 0.12 and not 0.13 see discussion on github
Update
pyarrow version 0.14 is out and should solve the problem.

Dask dataframe.read_csv not work correctly with hdfs csv file

Question

1 answers

solution1
0 2019-06-24 16:51:38

Dask dataframe.read_csv not work correctly with hdfs csv file

Question

1 answers

solution1 0 2019-06-24 16:51:38

solution1
0 2019-06-24 16:51:38