简体   繁体   中英

Match messages from WhatsApp log in python

I'd like to extract all patterns that are matching a message in WhatsApp. The messages have the following form:

One line message:

[19.09.17, 19:54:48] Marc: If the mean is not in the thousands, there's the problem

Multiple line long message:

[19.09.17, 19:54:59] Joe: > mean(aging$Population)
[1] 1593.577
Is what I get as solution

I was able to split it into Date, Time, Sender and Message but only for one-liners by first reading in the text-file line for line and then splitting those lines on the different seperators. However that does not work for messages with multiple lines. Now I'm trying to use regular expressions, with which I was able to get the dates and times but I am still struggling to extend the pattern for messages to multiple lines.

## reg expressions to match different things in the log
date = r'\[\d+\.\d+\.\d+,'
time = r'\d+\:\d+\:\d+]'
message = r':\s+.+\['
message = re.compile(message, re.DOTALL)

Please note that my log is from German WhatsApp, which is why the dates are a bit different. Also I ended on the , and ] to make sure I don't accidentally get matches from within messages.

I would like to do the same with the message pattern by ending on a [ which is usually the start of the next line (but might not be really robust if that can be found in a message on a new line).

There is probably a way easier solution but I am (as you can see) really bad with regex.

Here is a general regex and solution using re.findall :

msg = "[19.09.17, 19:54:48] Marc: If the mean is not in the thousands, there's the problem
    [19.09.17, 19:54:59] Joe: > mean(aging$Population)
    [1] 1593.577\nIs what I get as solution"

results = re.findall(r"\[(\d{2}\.\d{2}\.\d{2}), (\d{2}:\d{2}:\d{2})\] ([^:]+): (.*?)(?=\[\d{2}\.\d{2}\.\d{2}, \d{2}:\d{2}:\d{2}\]|$)", msg, re.MULTILINE|re.DOTALL)

for item in results:
    print "date: " + item[0]
    print "time: " + item[1]
    print "sender: " + item[2]
    print "message: " + item[3]

date: 19.09.17
time: 19:54:48
sender: Marc
message: If the mean is not in the thousands, there's the problem
date: 19.09.17
time: 19:54:59
sender: Joe
message: > mean(aging$Population)

The pattern, which appears long and bloated, just matches the structure of your expected WhatsApp message. Of note, the pattern uses both multiline and DOT ALL mode. This is needed for messages which may span across multiple lines. The pattern stops consuming a given message when either it sees the start of the next message (in particular, the timestamp), or it sees the end of the input.

劫持了上面的内容,以防万一,我只是从Tim Biegeleisen裁剪了正则表达式

results = re.findall(r"\[(\d{2}\.\d{2}\.\d{2}), (\d{2}:\d{2}:\d{2})\] ([^:]+): (.*?)(?=\[\d{2}\.\d{2}\.\d{2}, \d{2}:\d{2}:\d{2}\])", msg, re.MULTILINE|re.DOTALL)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM