從輸出中刪除HTML標簽

Question

我是python的新手，無法從輸出中刪除html標簽。 我想刪除標簽和其中的內容。 我也想刪除p標簽。 有什么建議么？

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

# Make sure file is clear for new content
open('ctp_output.txt', 'w').close()

# Open txt document for output
txt = open('ctp_output.txt', 'w')

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
tags = soup('p')
txt.write(str(tag) + '\n' + '\n')

# Close txt file with new content added
txt.close()

Answer 1

通過使用get_text()函數而不是字符串表示形式（ str(tag) ）從標記中檢索文本部分。

在上面的代碼中，更改將替換為以下行：

txt.write(str(tag) + '\n' + '\n')

與：

txt.write(tag.get_text() + '\n' + '\n')

從輸出中刪除HTML標簽

問題描述

1 個解決方案

解決方案1
0 已采納 2014-02-25 23:36:45

從輸出中刪除HTML標簽

問題描述

1 個解決方案

解決方案1 0 已采納 2014-02-25 23:36:45

解決方案1
0 已采納 2014-02-25 23:36:45