简体   繁体   中英

Dask dataframe.read_csv not work correctly with hdfs csv file

I want read csv data from hdfs server, but it throws an Exception,like below:

    hdfsSeek(desiredPos=64000000): FSDataInputStream#seek error:
    java.io.EOFException: Cannot seek after EOF
    at 
    org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1602)
    at 
    org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)

My Python code:

from dask import dataframe as dd
df = dd.read_csv('hdfs://SER/htmpa/a.csv').head(n=3)

csv file:

    user_id,item_id,play_count
    0,0,500
    0,1,3
    0,3,1
    1,0,4
    1,3,1
    2,0,1
    2,1,1
    2,3,5
    3,0,1
    3,3,4
    4,1,1
    4,2,8
    4,3,4

Are you running within and IDE or a jupyter notebook?
We are running on a Cloudera distribution and also get a similar error. From what we understand it is not connected to dask but rather to our hadoop configuration.
In any case we successfully use the pyarrow library when accessing hdfs . be aware that if you need to access parquet files run with version 0.12 and not 0.13 see discussion on github
Update
pyarrow version 0.14 is out and should solve the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM