Reading a txt file in url into BeautifulSOUP

Question

I have some url address that is to a txt file, which contians html code. This is a sample link:

http://www.sec.gov/Archives/edgar/data/70858/000119312507058027/0001193125-07-058027.txt

I want to read this html code with BeautifulSoup with such a code:

from bs4 import BeautifulSoup
import urllib2 

url =    "http://www.sec.gov/Archives/edgar/data/70858/000119312507058027/0001193125-07-058027.txt"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
print (soup.prettify())

However, I got a lot of errors like:

File "C:/Users/.../aa.py", line 7, in <module> print (soup.prettify()) File "build\\bdist.win32\\egg\\bs4\\element.py", line 1097, in prettify return self.decode(True, formatter=formatter)

I am suspicous that it happens because the url is to a txt file not a html. Am i right? If so, can someone let me know what is the solution here?

Answer 1

您可以尝试仅将文本文件的HTML部分（来自标记）输入到Beautiful汤中，我想它会中断，因为文本文件的开头不包含任何HTML。

Reading a txt file in url into BeautifulSOUP

Question

1 answers

solution1
1 ACCPTED 2015-02-04 21:08:23

Reading a txt file in url into BeautifulSOUP

Question

1 answers

solution1 1 ACCPTED 2015-02-04 21:08:23

solution1
1 ACCPTED 2015-02-04 21:08:23