I want read csv data from hdfs server, but it throws an Exception,like below:
hdfsSeek(desiredPos=64000000): FSDataInputStream#seek error:
java.io.EOFException: Cannot seek after EOF
at
org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1602)
at
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
My Python code:
from dask import dataframe as dd
df = dd.read_csv('hdfs://SER/htmpa/a.csv').head(n=3)
csv file:
user_id,item_id,play_count
0,0,500
0,1,3
0,3,1
1,0,4
1,3,1
2,0,1
2,1,1
2,3,5
3,0,1
3,3,4
4,1,1
4,2,8
4,3,4
Are you running within and IDE or a jupyter notebook?
We are running on a Cloudera distribution and also get a similar error. From what we understand it is not connected to dask
but rather to our hadoop
configuration.
In any case we successfully use the pyarrow library when accessing hdfs
. be aware that if you need to access parquet
files run with version 0.12
and not 0.13
see discussion on github
Update
pyarrow version 0.14
is out and should solve the problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.