从美丽的汤中获取信息并将其放入文本文件中？

Question

I have started to learn how to scrape information from websites using urllib and beautifulsoup. 我已经开始学习如何使用urllib和beautifulsoup从网站上抓取信息。 I want to grab all the text from this page (in the code) and put it into a text file. 我想从此页面（在代码中）获取所有文本，并将其放入文本文件中。

import urllib
from bs4 import BeautifulSoup as Soup
base_url = "http://www.galactanet.com/oneoff/theegg_mod.html"



url = (base_url)
soup = Soup(urllib.urlopen(url))

print(soup.get_text())

When I run this it grabs the text although it outputs it with spaces between all the letters and still shows me HTML, unsure why though. 当我运行它时，它会抓取文本，尽管它会在所有字母之间输出空格，但仍显示HTML，但不确定为什么。

i   n   '   >      Y   u   p   .       B   u   t       d   o   n      t       f   e   e

Like that, any idea's? 这样，有什么主意吗？

Also what would I do to put this info into a text file for me? 另外，我该怎么做才能将该信息放入文本文件中？

(Using beautifulsoup4 and running ubuntu 12.04 and python 2.7) （使用beautifulsoup4并运行ubuntu 12.04和python 2.7）

Thank you :) 谢谢：）

Answer 1

You could try using html2text : 您可以尝试使用html2text ：

import html2text as htmlconverter
print htmlconverter.html2text('<HTML><BODY>HI</BODY></HTML>')

Answer 2

I had some trouble with the encoding, so I changed your code slightly, then added the piece to print the results to a file: 我在编码方面遇到了一些麻烦，因此我稍稍更改了代码，然后添加了一段代码以将结果打印到文件中：

import urllib
from bs4 import BeautifulSoup as Soup

base_url = "http://www.galactanet.com/oneoff/theegg_mod.html"

url = (base_url)
content = urllib.urlopen(url)
soup = Soup(content)
# print soup.original_encoding
theegg_text = soup.get_text().encode("windows-1252")

f = open("somefile.txt", "w")
f.write(theegg_text);
f.close()

从美丽的汤中获取信息并将其放入文本文件中？

问题描述

2 个解决方案

解决方案1
0 2012-10-17 23:42:34

解决方案2
0 已采纳 2012-10-17 23:56:58

从美丽的汤中获取信息并将其放入文本文件中？

问题描述

2 个解决方案

解决方案1 0 2012-10-17 23:42:34

解决方案2 0 已采纳 2012-10-17 23:56:58

解决方案1
0 2012-10-17 23:42:34

解决方案2
0 已采纳 2012-10-17 23:56:58