[英]Split a string by regex and keep the seperator AS A PART OF ITEMS in python
我想按日期拆分 whatsapp 聊天備份文本並將日期保留為消息的一部分。 我嘗試過但無法達到我想要的確切結果。 如果有人可以建議我實現這一目標的方法,那將是一個很大的幫助。 (我不太了解正則表達式)
import re
chat = '27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'
regex = r"(\b\d+/\d+/\d+.*?(?=\b\d+/\d+/\d+|$)*)"
results = re.split(regex, chat)
print(results)
上面的代碼完成了工作並將分隔符保持為單獨的項目,但我希望它成為其相應消息(項目)的一部分:
當前結果
['27/01/2019',
'08:58 - You were added',
'19/03/2019',
'19:29 - Member 02: Hello guys,,',
'19/03/2019',
'19:29 - Member 03: Hi there..']
我想要的是
['27/01/2019, 08:58 - You were added',
'19/03/2019, 19:29 - Member 02: Hello guys',
'19/03/2019, 19:29 - Member 03: Hi there..']
請您嘗試 Pypy 正則表達式解決方案:
import regex as re
chat = '''27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'''
pat = r'(?V1)\n*(?=\d{2}/\d{2}/\d{4})'
results = re.split(pat, chat)
print(results[1:])
Output:
['27/01/2019, 08:58 - Member 01 created group "Python Lovers \xe2\x9d\xa4\xef\xb8\x8f"', '27/01/2019, 08:58 - You were\nadded', '19/03/2019, 19:29 - Member 02: Hello guys,,,', '19/03/2019, 19:29 - Member 03: Hi there..']
(?V1)
標志使零寬度匹配正常工作。\n*(?=\d{2}/\d{2}/\d{4})
匹配日期字段,在結果中保留匹配的字符串。results[1:]
刪除列表開頭的空項。發生這種情況是因為您使用re.split
將結果列表中捕獲的塊作為單獨的項目保存。
僅當您的匹配可以跨越多行時,您的正則表達式才有意義,否則,提取任何以類時間模式開頭的行就足夠了。
這就是為什么我建議
regex = r"\b\d+/\d+/\d.*?(?=\s*\b\d+/\d+/\d+|$)"
results = re.findall(regex, chat, re.S)
請參閱 Python 演示:
import re
chat = '''27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'''
regex = r"\b\d+/\d+/\d.*?(?=\s*\b\d+/\d+/\d+|$)"
results = re.findall(regex, chat, re.S)
for r in results:
print(r)
Output:
27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..
請注意,在使它成為可選的正先行之后,沒有冗余捕獲組並且沒有*
。 在前瞻中使用\s*
模式去除每場比賽結束時的空白。
re.S
標志允許.
匹配任何字符,包括換行字符。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.