簡體   English   中英

通過正則表達式拆分字符串並將分隔符保留為 python 中的項目的一部分

[英]Split a string by regex and keep the seperator AS A PART OF ITEMS in python

我想按日期拆分 whatsapp 聊天備份文本並將日期保留為消息的一部分。 我嘗試過但無法達到我想要的確切結果。 如果有人可以建議我實現這一目標的方法,那將是一個很大的幫助。 (我不太了解正則表達式)

import re

chat = '27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'

regex = r"(\b\d+/\d+/\d+.*?(?=\b\d+/\d+/\d+|$)*)"
results = re.split(regex, chat)
print(results)

上面的代碼完成了工作並將分隔符保持為單獨的項目,但我希望它成為其相應消息(項目)的一部分:

當前結果

['27/01/2019', 
'08:58 - You were added',
'19/03/2019', 
'19:29 - Member 02: Hello guys,,', 
'19/03/2019', 
'19:29 - Member 03: Hi there..']

我想要的是

['27/01/2019, 08:58 - You were added',
'19/03/2019, 19:29 - Member 02: Hello guys', 
'19/03/2019, 19:29 - Member 03: Hi there..']

請您嘗試 Pypy 正則表達式解決方案:

import regex as re

chat = '''27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'''

pat = r'(?V1)\n*(?=\d{2}/\d{2}/\d{4})'
results = re.split(pat, chat)
print(results[1:])

Output:

['27/01/2019, 08:58 - Member 01 created group "Python Lovers \xe2\x9d\xa4\xef\xb8\x8f"', '27/01/2019, 08:58 - You were\nadded', '19/03/2019, 19:29 - Member 02: Hello guys,,,', '19/03/2019, 19:29 - Member 03: Hi there..']
  • (?V1)標志使零寬度匹配正常工作。
  • 分隔符\n*(?=\d{2}/\d{2}/\d{4})匹配日期字段,在結果中保留匹配的字符串。
  • results[1:]刪除列表開頭的空項。

發生這種情況是因為您使用re.split將結果列表中捕獲的塊作為單獨的項目保存。

僅當您的匹配可以跨越多行時,您的正則表達式才有意義,否則,提取任何以類時間模式開頭的行就足夠了。

這就是為什么我建議

regex = r"\b\d+/\d+/\d.*?(?=\s*\b\d+/\d+/\d+|$)"
results = re.findall(regex, chat, re.S)

請參閱 Python 演示:

import re

chat = '''27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..'''

regex = r"\b\d+/\d+/\d.*?(?=\s*\b\d+/\d+/\d+|$)"
results = re.findall(regex, chat, re.S)
for r in results:
    print(r)

Output:

27/01/2019, 08:58 - Member 01 created group "Python Lovers ❤️"
27/01/2019, 08:58 - You were added
19/03/2019, 19:29 - Member 02: Hello guys,,,
19/03/2019, 19:29 - Member 03: Hi there..

請注意,在使它成為可選的正先行之后,沒有冗余捕獲組並且沒有* 在前瞻中使用\s*模式去除每場比賽結束時的空白。

re.S標志允許. 匹配任何字符,包括換行字符。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM