如何使用 Chunk 读取大文件（大于 10 亿）

Question

I have a dataset bigger than 1 billion rows and I would like to read it by 100k rows.我有一个超过 10 亿行的数据集，我想按 10 万行读取它。 First I have tried to read it with nrows as below:首先，我尝试使用 nrows 阅读它，如下所示：

df = pd.read_csv("filename.csv",nrows=100000,sep='|')

But it throws an UnicodeDecodeError as below.但它会抛出一个UnicodeDecodeError如下。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 3: unexpected end of data

Then I tried it with chunksize parameter as below:然后我用chunksize参数试了一下，如下所示：

a= pd.read_csv("filename.csv", chunksize=10000,sep='|')
pd.concat(a)

But it gives same error to me.但它给了我同样的错误。 I have also tried Dask library as below but it gives same error.我也试过如下 Dask 库，但它给出了同样的错误。

import dask.dataframe as dd
df = dd.read_csv('filename.csv')

Can you please help for the solution?你能帮忙解决吗？

Thanks in advance.提前致谢。

Answer 1

解决方案如下：

df = pd.read_csv("filename.csv", nrows=100000, sep="|", encoding="iso8859-9")

如何使用 Chunk 读取大文件（大于 10 亿）

问题描述

1 个解决方案

解决方案1
0 2021-07-28 06:52:10

如何使用 Chunk 读取大文件（大于 10 亿）

问题描述

1 个解决方案

解决方案1 0 2021-07-28 06:52:10

解决方案1
0 2021-07-28 06:52:10