简体   繁体   English

如何使用 Chunk 读取大文件(大于 10 亿)

[英]How to read a big file(bigger than 1 billion) with Chunk

I have a dataset bigger than 1 billion rows and I would like to read it by 100k rows.我有一个超过 10 亿行的数据集,我想按 10 万行读取它。 First I have tried to read it with nrows as below:首先,我尝试使用 nrows 阅读它,如下所示:

df = pd.read_csv("filename.csv",nrows=100000,sep='|')

But it throws an UnicodeDecodeError as below.但它会抛出一个UnicodeDecodeError如下。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 3: unexpected end of data

Then I tried it with chunksize parameter as below:然后我用chunksize参数试了一下,如下所示:

a= pd.read_csv("filename.csv", chunksize=10000,sep='|')
pd.concat(a)

But it gives same error to me.但它给了我同样的错误。 I have also tried Dask library as below but it gives same error.我也试过如下 Dask 库,但它给出了同样的错误。

import dask.dataframe as dd
df = dd.read_csv('filename.csv')

Can you please help for the solution?你能帮忙解决吗?

Thanks in advance.提前致谢。

解决方案如下:

df = pd.read_csv("filename.csv", nrows=100000, sep="|", encoding="iso8859-9")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM