简体   繁体   中英

Python deleting from text specific type of words

I would like to delete from whatsapp chat.txt file all the dates, username and emoticon. The file looks like this:

10/4/19, 7:18 PM - user1: example chat
10/4/19, 7:18 PM - user2: 😂
10/4/19, 7:18 PM - user3: example chat
10/4/19, 7:18 PM - user1: example chat
10/4/19, 7:18 PM - user2: 😂
10/4/19, 7:18 PM - user3: example chat

It is possible to write a script in python that recognizes the username and dates deleting it. Leaving only the chat text? I immagine i should use regex expression and maybe convert all the text to a string.

Please help

Similar question about regex and Whatsapp logs with python

Regex to match whatsapp chat log

Code from the first answer


^
(?P<datetime>\d{2}/\d{2}/\d{4}[^-]+)\s+-\s+
(?P<name>[^:]+):\s+
(?P<message>[\s\S]+?)
(?=^\d{2}|\Z)

A super simple way here would be to iterate line by line and split on : . If we can assume that the date, time - username: message will always follow this format, we can grab everything after the second :

text = '''10/4/19, 7:18 PM - user1: example chat
10/4/19, 7:18 PM - user2: 😂
10/4/19, 7:18 PM - user3: example chat
10/4/19, 7:18 PM - user1: example chat
10/4/19, 7:18 PM - user2: 😂
10/4/19, 7:18 PM - user3: example chat'''

for message in text.split('\n'):
    print(message.split(':')[2:][0])

Outputs

 example chat
 😂
 example chat
 example chat
 😂
 example chat

Another way is to build a regexp for that. Emoji regexp taken from here

import re

str_in = """10/4/19, 7:18 PM - user1: example chat 
            10/4/19, 7:18 PM - user2: 😂  
            10/4/19, 7:18 PM - user3: example chat  
            10/4/19, 7:18 PM - user1: example chat  
            10/4/19, 7:18 PM - user2: 😂  
            10/4/19, 7:18 PM - user3: example chat"""

dates_filtered = re.sub(r'(\d+\/\d+\/\d+, \d+:\d+ [AP]M - [ \d\w]+: )', '', str_in)

regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
emoji_filtered = regrex_pattern.sub(r'',dates_filtered)


blank_lines_filtered = re.sub(r'(\n\s*\n)', '\n', emoji_filtered)

print(str_in)
print('---------')
print(dates_filtered)
print('---------')
print(emoji_filtered)
print('---------')
print(blank_lines_filtered)

prints

10/4/19, 7:18 PM - user1: example chat 
10/4/19, 7:18 PM - user2: 😂
10/4/19, 7:18 PM - user3: example chat 
10/4/19, 7:18 PM - user1: example chat 
10/4/19, 7:18 PM - user2: 😂 
10/4/19, 7:18 PM - user3: example chat
---------
example chat 
😂
example chat 
example chat 
😂 
example chat
---------
example chat
              
example chat
example chat 

example chat
--------- 
example chat
example chat
example chat 
example chat
--------- 

You can also use list comprehension:

print([ message.split(':')[2:][0] for message in text.split('\n') ])

here

`sentence='10/4/19, 7:18 PM - user1: example chat 10/4/19, 7:18 PM - user2: 10/4/19, 7:18 PM - user3: example chat 10/4/19, 7:18 PM - user1: example chat 10/4/19, 7:18 PM - user2: 10/4/19, 7:18 PM - user3: example chat'

chat=re.findall('-\suser\d:\s([a-zA-Z\d]|.*?) \d', sentence)

print(chat)`

output:

['example chat', '😂', 'example chat', 'example chat', '😂']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM