Python =达人vs熊猫，read_csv中的错误

Question

I've got an error on reading a file with dask, which work with pandas : 使用dask读取文件时出现错误，该文件可与pandas一起使用：

import dask.dataframe as dd
import pandas as pd
pdf = pd.read_csv("./tous_les_docs.csv")
pdf.shape
(20140796, 7)

while dask gives me an error : 虽然dask给我一个错误：

df = dd.read_csv("./tous_les_docs.csv")
df.describe().compute()
ParserError: Error tokenizing data. C error: EOF inside string starting at line 192999

Answer : Adding "blocksize=None" make it work : 答：添加“ blocksize = None”使其起作用：

df = dd.read_csv("./tous_les_docs.csv", blocksize=None)

Answer 1

The documentation says that this could happen 文档说这可能发生

It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. 还应注意，如果CSV文件包含带引号的包含行终止符的字符串，则此功能可能会失败。 To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism. 为了解决这个问题，您可以指定blocksize = None不将文件拆分为多个分区，其代价是减少了并行性。

http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

It seems Dask chops the file in chunks by line terminator but without scanning the whole file from the start, to see if a line terminator is in a string. 似乎达斯克（Dask）通过行终止符将文件切成碎片，但没有从头开始扫描整个文件，以查看行终止符是否在字符串中。

Python =达人vs熊猫，read_csv中的错误

问题描述

1 个解决方案

解决方案1
1 2019-04-29 12:13:19

Python =达人vs熊猫，read_csv中的错误

问题描述

1 个解决方案

解决方案1 1 2019-04-29 12:13:19

解决方案1
1 2019-04-29 12:13:19