如何使用 python 的正則表達式從文本文件中刪除多個標簽

Question

新手來了我正在使用 Python 3.8.3 並試圖從附加的文本文件listfile.txt中刪除標簽

我想提取 3 個列表 - 文章的標題、出版日期和正文並刪除標簽。 在下面的代碼中，我已經能夠從標題和出版日期中刪除標簽。 但是，我無法從正文中正確刪除所有標簽。 在文件中，正文以標簽<div class="story-element story-element-text">開始，並在下一個 <h1 class 標簽之前結束。

對提取這部分文本的任何幫助將不勝感激。 文章文本為非英文腳本，但所有 html 標簽均為英文。

#opening text file which contains newspaper article information scraped off website using beautifulsoup
with open('listfile.txt', 'r', encoding='utf8') as my_file:
    text = my_file.read()
    print(text)  

#removing tags and generating list of newspaper article titles    
titles = re.findall('<h1.*?>(.*?)</h1>', text)
print(titles) 

#removing tags and generating list of newspaper article publication dates 
dates = re.findall('<div class=\"storyPageMetaData-m__publish-time__19bdV\"><span>(.*?)</span>', text)
print(dates)

#removing tags and generating list containing main text of articles. This is where the code is incorrect
bodytext= re.findall('<div class=\"story-element story-element-text\">(.*?)</div>', text)
print(bodytext)

Answer 1

我認為您使用了錯誤的工具，我建議您改用bs4 ； 你會喜歡它我 promise。

from bs4 import BeautifulSoup
raw_html = "YOUR RAW HTML"
soup = BeautifulSoup(raw_html, "html.parser")
titles = [h1_tag.text for h1_tag in soup.select('h1')]
dates = [span_tag.text for span_tag in soup.select('div.storyPageMetaData-m__publish-time__19bdV > span')]
bodytext = [div_tag.text for div_tag in soup.select('div.story-element.story-element-text')]

享受

Answer 2

我不熟悉如何在 python 中設置正則表達式，但這適用於 JavaScript

如果您仍想使用 RegEx，請使用它來捕獲文本文件中的 h1 標簽。 <h1(.*?)</h1>

``

如何使用 python 的正則表達式從文本文件中刪除多個標簽

問題描述

2 個解決方案

解決方案1
0 2021-03-10 01:22:59

解決方案2
0 2021-03-10 02:29:31

如何使用 python 的正則表達式從文本文件中刪除多個標簽

問題描述

2 個解決方案

解決方案1 0 2021-03-10 01:22:59

解決方案2 0 2021-03-10 02:29:31

解決方案1
0 2021-03-10 01:22:59

解決方案2
0 2021-03-10 02:29:31