在pandas DataFrame中加载逗号不均的文本文件

Question

15/09/2017, 10:20 - Jatin: Robin is the meeting on???
15/09/2017, 10:23 - Robin: No
15/09/2017, 10:23 - Robin: Thanks for the update
15/09/2017, 10:23 - Robin: can we expect it soon
15/09/2017, 10:24 - Jatin: it will be this weekend, most likely
15/09/2017, 10:24 - Jatin: kindly be prepared
15/09/2017, 10:24 - Robin: Sure no issues
15/09/2017, 10:26 - Jatin: good luck

I have a data file that looks like this. 我有一个看起来像这样的数据文件。 I intend to load this in a pandas dataframe. 我打算将其加载到pandas数据框中。 Issue is that if I do 问题是，如果我这样做

pd.read_csv("file.txt")

It throws an error: 它抛出一个错误：

Error tokenizing data. 标记数据时出错。 C error: Expected 2 fields in line 695, saw 3 C错误：第695行中应有2个字段，看到了3个

Can someone please suggest the easiest possible way to do this with pandas? 有人可以建议用熊猫做这件事的最简单方法吗？

Answer 1

It appears to be a watsapp email chat file you are trying to load. 它似乎是您尝试加载的watsapp电子邮件聊天文件。 I worked on something similar and here is a code that worked for me. 我做了类似的工作，这是对我有用的代码。

atempt_load=pd.read_table("WhatsApp Chat with Panda.txt")
atempt_load.columns=["namesake"] # this will load the entire message ina single column and we are just giving it a convenient name, in order to use it later
name=[]
message=[]
for i in range(len(atempt_load)):
#now there are 20 characters in front of each line before a name appears,
# we can use this and use the following coed to separate it

    name.append((atempt_load["namesake"][i])[20:25]) #since both the names are of same length this will take out the string from 20:25 words
    message.append((atempt_load["namesake"][i])[26:len(atempt_load["namesake"][i])])

You can do a similar thing if you want timestamps as well. 如果还需要时间戳，则可以执行类似的操作。

Limitations: It will not work if the names are of different lengths, I found a way around it by changing the names of contacts in the chat before importing a file in email. 局限性：如果名称的长度不同，它将无法正常工作，我找到了解决方法，可以在将文件导入电子邮件之前更改聊天中的联系人姓名。

I am sure someone will have a more dynamic and cleaner fix 我相信有人会提供更动态更清洁的修复程序

Answer 2

Alternatively, specify the separator more explicitly: 或者，更明确地指定分隔符：

pd.read_csv('test.txt', names=['timestamp', 'text'], sep=' - ')

This is will throw a warning about falling back to the python engine. 这将引发有关回退到python引擎的警告。 That is just a warning that performance may be reduced for very large files. 这只是警告，可能会降低非常大文件的性能。

在pandas DataFrame中加载逗号不均的文本文件

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-06-18 17:03:25

解决方案2
0 2018-06-18 17:21:15

在pandas DataFrame中加载逗号不均的文本文件

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-06-18 17:03:25

解决方案2 0 2018-06-18 17:21:15

解决方案1
0 已采纳 2018-06-18 17:03:25

解决方案2
0 2018-06-18 17:21:15