使用 Python 读取大型 csv 文件

Question

I used Dask to read 2.5GB csv file and Python gave me errors.我使用 Dask 读取 2.5GB csv 文件，Python 给了我错误。 This is the code I wrote:这是我写的代码：

import pandas as pd
import numpy as np
import time
from dask import dataframe as df1

s_time_dask = time.time()
dask_df = df1.read_csv('3SPACK_N150_7Ah_PressureDistributionStudy_Data_Matrix.csv')
e_time_dask = time.time()

The following is the error I got from Python:以下是我从 Python 得到的错误：

dask_df = df1.read_csv('3SPACK_N150_7Ah_PressureDistributionStudy_Data_Matrix.csv') dask_df = df1.read_csv('3SPACK_N150_7Ah_PressureDistributionStudy_Data_Matrix.csv')

File "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\dask\\dataframe\\io\\csv.py", line 645, in read return read_pandas(文件“C:\\ProgramData\\Anaconda3\\lib\\site-packages\\dask\\dataframe\\io\\csv.py”，第 645 行，读取返回 read_pandas(

File "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\dask\\dataframe\\io\\csv.py", line 525, in read_pandas head = reader(BytesIO(b_sample), **kwargs)文件“C:\\ProgramData\\Anaconda3\\lib\\site-packages\\dask\\dataframe\\io\\csv.py”，第 525 行，在 read_pandas head = reader(BytesIO(b_sample), **kwargs)

File "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py", line 686, in read_csv return _read(filepath_or_buffer, kwds)文件“C:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py”，第 686 行，在 read_csv 中 return _read(filepath_or_buffer, kwds)

File "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py", line 458, in _read data = parser.read(nrows)文件“C:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py”，第 458 行，在 _read data = parser.read(nrows)

File "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py", line 1196, in read ret = self._engine.read(nrows)文件“C:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py”，第 1196 行，读取 ret = self._engine.read(nrows)

File "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py", line 2155, in read data = self._reader.read(nrows)文件“C:\\ProgramData\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py”，第 2155 行，读取数据 = self._reader.read(nrows)

File "pandas_libs\\parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read文件“pandas_libs\\parsers.pyx”，第 847 行，在 pandas._libs.parsers.TextReader.read

File "pandas_libs\\parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory文件“pandas_libs\\parsers.pyx”，第 862 行，在 pandas._libs.parsers.TextReader._read_low_memory

File "pandas_libs\\parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows文件“pandas_libs\\parsers.pyx”，第 918 行，在 pandas._libs.parsers.TextReader._read_rows

File "pandas_libs\\parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows文件“pandas_libs\\parsers.pyx”，第 905 行，在 pandas._libs.parsers.TextReader._tokenize_rows

File "pandas_libs\\parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error文件“pandas_libs\\parsers.pyx”，第 2042 行，在 pandas._libs.parsers.raise_parser_error

ParserError: Error tokenizing data. ParserError：标记数据时出错。 C error: Expected 1 fields in line 43, saw 9 C 错误：第 43 行预期有 1 个字段，看到 9

Can you please help me with this problem?你能帮我解决这个问题吗？

Thanks谢谢

Answer 1

Your error has nothing to do with memory.您的错误与内存无关。 Dask loads text files like CSVs chunk-wise, by choosing fixed bytes offsets and then scanning from each offset to the nearest newline character. Dask 通过选择固定字节偏移量然后从每个偏移量扫描到最近的换行符来逐块加载 CSV 等文本文件。 This is so that you can access the same file from multiple processes or even multiple machines, and only work on as many chunks as you have worker threads at a time.这样您就可以从多个进程甚至多台机器访问同一个文件，并且一次只能处理与工作线程一样多的块。

Unfortunately, a newline character doesn't always mean the end of a row, since they can occur within quoted strings of some text field.不幸的是，换行符并不总是意味着行的结束，因为它们可能出现在某些文本字段的带引号的字符串中。 This means that you essentially cannot read the file with dask's read_csv, unless you preemptively find a set of byte offsets that guarantees clean partitioning without breaking in the middle of a quoted string.这意味着您基本上无法使用 dask 的 read_csv 读取文件，除非您先发制人地找到一组字节偏移量，以保证干净的分区而不会在带引号的字符串中间中断。

Answer 2

In short: you're out of memory.简而言之：你的内存不足。 You're trying to load more data into python than can fit in the memory in your machine (python's memory usage is higher than C/C++/etc, but you'd still hit a limit with those languages too).您正在尝试将更多数据加载到 python 中，而不是您机器的内存中（python 的内存使用率高于 C/C++/etc，但您仍然会遇到这些语言的限制）。

To fix this, you probably need to read the file using csvreader instead, where you can read it line by line.要解决此问题，您可能需要改为使用csvreader读取文件，您可以在其中逐行读取。 Then process the line to take only the columns you want or start any aggregation you want to do on a line by line basis.然后处理该行以仅获取您想要的列或开始您想要逐行进行的任何聚合。 If you can't do this, then you either need to use a smaller dataset if you really need all of the data in memory at once, or to use a system with more memory.如果你不能这样做，那么你要么需要使用较小的数据集，如果你真的需要一次内存中的所有数据，要么使用具有更多内存的系统。

If your file is 2.5G, it wouldn't surprise me if your system would need ~20GB of memory or so.如果您的文件是 2.5G，那么如果您的系统需要大约 20GB 左右的内存，我不会感到惊讶。 But the right way to estimate is to load a fixed number of rows, figure out how much your process is using, then read twice that number of rows and look at the memory usage again.但正确的估计方法是加载固定数量的行，计算出您的进程正在使用多少行，然后读取两倍的行数并再次查看内存使用情况。 Subtract the lower number from the higher and that's likely how much memory (approximately) you need to hold that many rows.从较高的数字中减去较低的数字，这可能是您需要多少内存（大约）来保存那么多行。 You can then calculate how much you need for all the data.然后，您可以计算所有数据所需的数量。

Answer 3

If you really need to open all your data, you can do it in chunks so it doesn't take all your memory: read_csv() has an attribute called chunksize .如果你真的需要打开你的所有数据，你可以分块进行，这样它就不会占用你所有的内存： read_csv()有一个名为chunksize的属性。 You can see how it works at kite.com .你可以在 kite.com 上看到它是如何工作的。

You can also check the pandas documentation .您还可以查看pandas 文档。

使用 Python 读取大型 csv 文件

问题描述

3 个解决方案

解决方案1
1 2021-06-17 18:20:06

解决方案2
0 2021-06-17 14:44:36

解决方案3
0 2021-06-17 14:48:13

使用 Python 读取大型 csv 文件

问题描述

3 个解决方案

解决方案1 1 2021-06-17 18:20:06

解决方案2 0 2021-06-17 14:44:36

解决方案3 0 2021-06-17 14:48:13

解决方案1
1 2021-06-17 18:20:06

解决方案2
0 2021-06-17 14:44:36

解决方案3
0 2021-06-17 14:48:13