如何将 .txt 数据导入到 Pandas 数据框中？

Question

I am trying to import the data from the file at https://drive.google.com/file/d/1leOUk4Z5xp9tTiFLpxgk_7KBv3xwn5eW/view into a pandas dataframe.我正在尝试将https://drive.google.com/file/d/1leOUk4Z5xp9tTiFLpxgk_7KBv3xwn5eW/view 上的文件中的数据导入到熊猫数据框中。 I have tried using我试过使用

    data = pd.read_csv('data_engineering_assignment.txt',sep="|")

but I got an error saying "ParserError: Error tokenizing data. C error: Expected 9 fields in line 231, saw 10" I dont want to use 'error_bad_lines=False' and skip lines of data.但我收到一条错误消息：“ParserError：错误标记数据。C 错误：第 231 行预期有 9 个字段，看到 10 个”我不想使用 'error_bad_lines=False' 并跳过数据行。

Kindly help.请帮忙。

Answer 1

You have a problem in your dataset, the problem is that sometimes, i find |你的数据集有问题，问题是有时，我发现| in the description_text : for example, for this id 5d0c7c4c312ff75188d84954 you have |在 description_text 中：例如，对于此 ID 5d0c7c4c312ff75188d84954您有| in of A|X design , so pandas considered the second part as a new column (that's why you have the message : Expected 9 fields, but saw 10 I hope this will helps you to understand the problem.在of A|X design ，因此 Pandas 将第二部分视为一个新列（这就是为什么您Expected 9 fields, but saw 10消息： Expected 9 fields, but saw 10我希望这能帮助您理解问题。

Answer 2

You can specify the columns names, stating that there are 10:您可以指定列名称，说明有 10 个：

import pandas as pd

cols = ['_id','name','price','website_id','sku','url','brand','media','description_text','other']
dataframe = pd.read_csv('./data_engineering_assignment.txt', names=cols, sep='|' )
dataframe['description_text'] = dataframe['description_text'].map(str) + dataframe['other']
dataframe.to_csv('./data_engineering_assignment_v2.txt', index=False, sep=',')

You'll get a warning on memory usage due to pandas having to guess the column data type, but it's ok由于熊猫必须猜测列数据类型，您将收到有关内存使用情况的警告，但没关系

如何将 .txt 数据导入到 Pandas 数据框中？

问题描述

2 个解决方案

解决方案1
1 2019-11-29 08:33:39

解决方案2
0 2019-11-29 08:36:21

如何将 .txt 数据导入到 Pandas 数据框中？

问题描述

2 个解决方案

解决方案1 1 2019-11-29 08:33:39

解决方案2 0 2019-11-29 08:36:21

解决方案1
1 2019-11-29 08:33:39

解决方案2
0 2019-11-29 08:36:21