简体   繁体   English

匹配来自Python中WhatsApp日志的消息

[英]Match messages from WhatsApp log in python

I'd like to extract all patterns that are matching a message in WhatsApp. 我想提取与WhatsApp中的消息匹配的所有模式。 The messages have the following form: 消息具有以下形式:

One line message: 一行信息:

[19.09.17, 19:54:48] Marc: If the mean is not in the thousands, there's the problem

Multiple line long message: 多行长消息:

[19.09.17, 19:54:59] Joe: > mean(aging$Population)
[1] 1593.577
Is what I get as solution

I was able to split it into Date, Time, Sender and Message but only for one-liners by first reading in the text-file line for line and then splitting those lines on the different seperators. 通过首先读取文本文件行中的行,然后在不同的分隔符上拆分这些行,我能够将其拆分为日期,时间,发件人和消息,但仅用于单行。 However that does not work for messages with multiple lines. 但是,这不适用于多行邮件。 Now I'm trying to use regular expressions, with which I was able to get the dates and times but I am still struggling to extend the pattern for messages to multiple lines. 现在,我正在尝试使用正则表达式,使用它们可以获取日期和时间,但是我仍在努力将消息的模式扩展到多行。

## reg expressions to match different things in the log
date = r'\[\d+\.\d+\.\d+,'
time = r'\d+\:\d+\:\d+]'
message = r':\s+.+\['
message = re.compile(message, re.DOTALL)

Please note that my log is from German WhatsApp, which is why the dates are a bit different. 请注意,我的日志来自德语WhatsApp,这就是为什么日期有些不同的原因。 Also I ended on the , and ] to make sure I don't accidentally get matches from within messages. 另外,我以和结束,以确保不会意外从邮件中获得匹配项。

I would like to do the same with the message pattern by ending on a [ which is usually the start of the next line (but might not be really robust if that can be found in a message on a new line). 我想通过在[通常是下一行的开始处(结束,但是如果可以在新行的消息中找到它的话,可能并不那么健壮)来对消息模式做同样的事情。

There is probably a way easier solution but I am (as you can see) really bad with regex. 也许有一种更简单的解决方案,但是(如您所见)我对正则表达式确实不好。

Here is a general regex and solution using re.findall : 这是使用re.findall的常规正则表达式和解决方案:

msg = "[19.09.17, 19:54:48] Marc: If the mean is not in the thousands, there's the problem
    [19.09.17, 19:54:59] Joe: > mean(aging$Population)
    [1] 1593.577\nIs what I get as solution"

results = re.findall(r"\[(\d{2}\.\d{2}\.\d{2}), (\d{2}:\d{2}:\d{2})\] ([^:]+): (.*?)(?=\[\d{2}\.\d{2}\.\d{2}, \d{2}:\d{2}:\d{2}\]|$)", msg, re.MULTILINE|re.DOTALL)

for item in results:
    print "date: " + item[0]
    print "time: " + item[1]
    print "sender: " + item[2]
    print "message: " + item[3]

date: 19.09.17
time: 19:54:48
sender: Marc
message: If the mean is not in the thousands, there's the problem
date: 19.09.17
time: 19:54:59
sender: Joe
message: > mean(aging$Population)

The pattern, which appears long and bloated, just matches the structure of your expected WhatsApp message. 该模式看起来很长而且很肿胀,刚好符合您所期望的WhatsApp消息的结构。 Of note, the pattern uses both multiline and DOT ALL mode. 值得注意的是,该模式同时使用多行和DOT ALL模式。 This is needed for messages which may span across multiple lines. 对于可能跨越多行的消息,这是必需的。 The pattern stops consuming a given message when either it sees the start of the next message (in particular, the timestamp), or it sees the end of the input. 当模式看到下一条消息的开始(特别是时间戳记)或看到输入的结束时,它停止使用给定的消息。

劫持了上面的内容,以防万一,我只是从Tim Biegeleisen裁剪了正则表达式

results = re.findall(r"\[(\d{2}\.\d{2}\.\d{2}), (\d{2}:\d{2}:\d{2})\] ([^:]+): (.*?)(?=\[\d{2}\.\d{2}\.\d{2}, \d{2}:\d{2}:\d{2}\])", msg, re.MULTILINE|re.DOTALL)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM