从.html文件中提取文本，删除HTML，然后使用Python和Beautiful Soup写入文本文件

Question

我正在使用Beautiful Soup 4从HTML文件中提取文本，并使用get_text()可以轻松地仅提取文本，但是现在我试图将该文本写入纯文本文件，当我这样做时，我得到了消息“ 416”。 这是我正在使用的代码：

from bs4 import BeautifulSoup
markup = open("example1.html")
soup = BeautifulSoup(markup)
f = open("example.txt", "w")
f.write(soup.get_text())

控制台的输出为416但没有任何内容写入文本文件。 我哪里出问题了？

Answer 1

您需要将文本发送到BeautifulSoup类。 也许尝试markup.read()

from bs4 import BeautifulSoup
markup = open("example1.html")
soup = BeautifulSoup(markup.read())
markup.close()
f = open("example.txt", "w")
f.write(soup.get_text())
f.close()

并以更Python风格

from bs4 import BeautifulSoup

with open("example1.html") as markup:
    soup = BeautifulSoup(markup.read())

with open("example.txt", "w") as f: 
    f.write(soup.get_text())

如@bernie建议

从.html文件中提取文本，删除HTML，然后使用Python和Beautiful Soup写入文本文件

问题描述

1 个解决方案

解决方案1
5 已采纳 2013-04-26 16:52:12

从.html文件中提取文本，删除HTML，然后使用Python和Beautiful Soup写入文本文件

问题描述

1 个解决方案

解决方案1 5 已采纳 2013-04-26 16:52:12

解决方案1
5 已采纳 2013-04-26 16:52:12