简体   繁体   English

在pandas DataFrame中加载逗号不均的文本文件

[英]Load a text file that has uneven commas in a pandas DataFrame

15/09/2017, 10:20 - Jatin: Robin is the meeting on???
15/09/2017, 10:23 - Robin: No
15/09/2017, 10:23 - Robin: Thanks for the update
15/09/2017, 10:23 - Robin: can we expect it soon
15/09/2017, 10:24 - Jatin: it will be this weekend, most likely
15/09/2017, 10:24 - Jatin: kindly be prepared
15/09/2017, 10:24 - Robin: Sure no issues
15/09/2017, 10:26 - Jatin: good luck

I have a data file that looks like this. 我有一个看起来像这样的数据文件。 I intend to load this in a pandas dataframe. 我打算将其加载到pandas数据框中。 Issue is that if I do 问题是,如果我这样做

pd.read_csv("file.txt") 

It throws an error: 它抛出一个错误:

Error tokenizing data. 标记数据时出错。 C error: Expected 2 fields in line 695, saw 3 C错误:第695行中应有2个字段,看到了3个

Can someone please suggest the easiest possible way to do this with pandas? 有人可以建议用熊猫做这件事的最简单方法吗?

It appears to be a watsapp email chat file you are trying to load. 它似乎是您尝试加载的watsapp电子邮件聊天文件。 I worked on something similar and here is a code that worked for me. 我做了类似的工作,这是对我有用的代码。

atempt_load=pd.read_table("WhatsApp Chat with Panda.txt")
atempt_load.columns=["namesake"] # this will load the entire message ina single column and we are just giving it a convenient name, in order to use it later
name=[]
message=[]
for i in range(len(atempt_load)):
#now there are 20 characters in front of each line before a name appears,
# we can use this and use the following coed to separate it

    name.append((atempt_load["namesake"][i])[20:25]) #since both the names are of same length this will take out the string from 20:25 words
    message.append((atempt_load["namesake"][i])[26:len(atempt_load["namesake"][i])])

You can do a similar thing if you want timestamps as well. 如果还需要时间戳,则可以执行类似的操作。

Limitations: It will not work if the names are of different lengths, I found a way around it by changing the names of contacts in the chat before importing a file in email. 局限性:如果名称的长度不同,它将无法正常工作,我找到了解决方法,可以在将文件导入电子邮件之前更改聊天中的联系人姓名。

I am sure someone will have a more dynamic and cleaner fix 我相信有人会提供更动态更清洁的修复程序

Alternatively, specify the separator more explicitly: 或者,更明确地指定分隔符:

pd.read_csv('test.txt', names=['timestamp', 'text'], sep=' - ') 

This is will throw a warning about falling back to the python engine. 这将引发有关回退到python引擎的警告。 That is just a warning that performance may be reduced for very large files. 这只是警告,可能会降低非常大文件的性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM