[英]How to read a big file(bigger than 1 billion) with Chunk
I have a dataset bigger than 1 billion rows and I would like to read it by 100k rows.我有一个超过 10 亿行的数据集,我想按 10 万行读取它。 First I have tried to read it with nrows as below:
首先,我尝试使用 nrows 阅读它,如下所示:
df = pd.read_csv("filename.csv",nrows=100000,sep='|')
But it throws an UnicodeDecodeError
as below.但它会抛出一个
UnicodeDecodeError
如下。
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 3: unexpected end of data
Then I tried it with chunksize
parameter as below:然后我用
chunksize
参数试了一下,如下所示:
a= pd.read_csv("filename.csv", chunksize=10000,sep='|')
pd.concat(a)
But it gives same error to me.但它给了我同样的错误。 I have also tried Dask library as below but it gives same error.
我也试过如下 Dask 库,但它给出了同样的错误。
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
Can you please help for the solution?你能帮忙解决吗?
Thanks in advance.提前致谢。
解决方案如下:
df = pd.read_csv("filename.csv", nrows=100000, sep="|", encoding="iso8859-9")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.