[英]Load a text file that has uneven commas in a pandas DataFrame
15/09/2017, 10:20 - Jatin: Robin is the meeting on???
15/09/2017, 10:23 - Robin: No
15/09/2017, 10:23 - Robin: Thanks for the update
15/09/2017, 10:23 - Robin: can we expect it soon
15/09/2017, 10:24 - Jatin: it will be this weekend, most likely
15/09/2017, 10:24 - Jatin: kindly be prepared
15/09/2017, 10:24 - Robin: Sure no issues
15/09/2017, 10:26 - Jatin: good luck
I have a data file that looks like this. 我有一个看起来像这样的数据文件。 I intend to load this in a pandas dataframe.
我打算将其加载到pandas数据框中。 Issue is that if I do
问题是,如果我这样做
pd.read_csv("file.txt")
It throws an error: 它抛出一个错误:
Error tokenizing data.
标记数据时出错。 C error: Expected 2 fields in line 695, saw 3
C错误:第695行中应有2个字段,看到了3个
Can someone please suggest the easiest possible way to do this with pandas? 有人可以建议用熊猫做这件事的最简单方法吗?
It appears to be a watsapp email chat file you are trying to load. 它似乎是您尝试加载的watsapp电子邮件聊天文件。 I worked on something similar and here is a code that worked for me.
我做了类似的工作,这是对我有用的代码。
atempt_load=pd.read_table("WhatsApp Chat with Panda.txt")
atempt_load.columns=["namesake"] # this will load the entire message ina single column and we are just giving it a convenient name, in order to use it later
name=[]
message=[]
for i in range(len(atempt_load)):
#now there are 20 characters in front of each line before a name appears,
# we can use this and use the following coed to separate it
name.append((atempt_load["namesake"][i])[20:25]) #since both the names are of same length this will take out the string from 20:25 words
message.append((atempt_load["namesake"][i])[26:len(atempt_load["namesake"][i])])
You can do a similar thing if you want timestamps as well. 如果还需要时间戳,则可以执行类似的操作。
Limitations: It will not work if the names are of different lengths, I found a way around it by changing the names of contacts in the chat before importing a file in email. 局限性:如果名称的长度不同,它将无法正常工作,我找到了解决方法,可以在将文件导入电子邮件之前更改聊天中的联系人姓名。
I am sure someone will have a more dynamic and cleaner fix 我相信有人会提供更动态更清洁的修复程序
Alternatively, specify the separator more explicitly: 或者,更明确地指定分隔符:
pd.read_csv('test.txt', names=['timestamp', 'text'], sep=' - ')
This is will throw a warning about falling back to the python engine. 这将引发有关回退到python引擎的警告。 That is just a warning that performance may be reduced for very large files.
这只是警告,可能会降低非常大文件的性能。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.