将CSV文件上传到Pandas Dataframe时错误标记数据

Question

I have an 8GB CSV file that contains information about companies created in France. 我有一个8GB的CSV文件，其中包含有关在法国创建的公司的信息。 When I try to upload it in Python using pandas.read_csv, I get various types of error; 当我尝试使用pandas.read_csv在Python中上载它时，出现各种类型的错误。 I believe it's a combination of 3 factors that cause the problem: 我认为这是导致问题的3个因素的组合：

The size of the file (8GB) 文件大小（8GB）
The French characters in the cells (like “é”) 单元格中的法语字符（例如“é”）
The fact that this CSV file is organized like an Excel file; 该CSV文件的组织方式类似于Excel文件。 the fields are separated by column, just like an XLS file 字段由列分开，就像XLS文件一样

When I tried to import the file using: 当我尝试使用以下方式导入文件时：

import pandas as pd
df = pd.read_csv(r'C:\..\data.csv')

I got the following error: OSError: Initializing from file failed 我收到以下错误： OSError：从文件初始化失败

Then, to eliminate the problem about the size, I copy the file (data.csv) and paste it, only keeping the first 25 rows (data2.csv). 然后，为了消除有关大小的问题，我复制了文件（data.csv）并将其粘贴，仅保留前25行（data2.csv）。 This is a much lighter file, to eliminate the size problem: 这是一个轻得多的文件，可以消除大小问题：

df = pd.read_csv(r'C:\..\data2.csv')

I get the same OSError: Initializing from file failed error. 我得到相同的OSError：从文件初始化失败错误。

After some research, I try the following code with Data2.csv 经过研究后，我尝试使用Data2.csv编写以下代码

df = pd.read_csv(r'C:\..\data2.csv', sep="\t", encoding="latin")

This time, the import successfully works, but in a weird format, like this: https://imgur.com/a/y6WJHC5 . 这次，导入成功完成，但是格式很奇怪，例如： https : //imgur.com/a/y6WJHC5 。 All fields are in the same column. 所有字段都在同一列中。

So this even with the size problem eliminated, it doesn't properly read the csv file. 因此，即使消除了大小问题，它也无法正确读取csv文件。 And still, I need to work with the main file, Data.csv. 而且，我需要使用主文件Data.csv。 So I try the same code on the initial file (data.csv): 因此，我在初始文件（data.csv）上尝试了相同的代码：

df = pd.read_csv(r'C:\..\data.csv', sep="\t", encoding="latin")

I get: ParserError: Error tokenizing data. 我得到： ParserError：错误标记数据。 C error: out of memory C错误：内存不足

What is the proper code to read this data.csv properly? 正确读取此data.csv的正确代码是什么？

Thank you, 谢谢，

Answer 1

From your image it looks like the file is separated by semi-colons (;). 从您的图像看来，文件用分号（;）分隔。 Try using ";" 尝试使用“;” as the sep in the read_csv function. 作为read_csv函数中的sep。

Pandas reads the csv into ram - an 8GB file could easily exhaust this - try reading the file in chunks. 熊猫将csv读取到ram中-一个8GB的文件可能会很容易用尽它-尝试分块读取文件。 See this answer. 看到这个答案。

将CSV文件上传到Pandas Dataframe时错误标记数据

问题描述

1 个解决方案

解决方案1
1 2019-01-04 16:02:36

将CSV文件上传到Pandas Dataframe时错误标记数据

问题描述

1 个解决方案

解决方案1 1 2019-01-04 16:02:36

解决方案1
1 2019-01-04 16:02:36