[英]Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup
我有一个.txt文件,其中包含指向多个页面的完整URL,每个页面均包含我要从中抓取数据的表。 我的代码适用于一个URL,但是当我尝试添加循环并从.txt文件中读取URL时,出现以下错误
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?
这是我的代码
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
f.close()
我已经检查了.txt文件,所有条目均正常。 它们以HTTP:开头,以.html结尾。 它们之间没有撇号或引号。 我编码错误的for循环吗?
运用
with open('urls.txt', 'r') as f:
for url in f:
print(url)
我得到以下
??http://www.thegreenpapers.com/PCC/AL-D.html
http://www.thegreenpapers.com/PCC/AL-R.html
http://www.thegreenpapers.com/PCC/AK-D.html
依此类推,共有100条线。 仅第一行带有问号。 我的.txt文件包含这些URL,但状态和参与方的缩写仅发生变化。
可以通过在代码中抽动两条不同的行来修复尝试的方法。
尝试这个:
with open('urls.txt', 'r') as f:
urls = f.readlines() #make sure this line is properly indented.
for url in urls:
uClient = urlopen(url.strip())
您无法使用'f.read()'将整个文件读入字符串,然后在该字符串上进行迭代。 要解决此问题,请参见下面的更改。 我也删除了您的最后一行。 当您使用“ with”语句时,它将在代码块完成时关闭文件。
Greg Hewgill的(Python 2)代码显示了url字符串是'str'还是'unicode'类型。
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
# Code from Greg Hewgill
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
with open('urls.txt', 'r') as f:
for url in f:
print(url)
whatisthis(url)
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
使用带有上面列出的URL的文本文件运行代码将产生以下输出:
http://www.thegreenpapers.com/PCC/AL-D.html
ordinary string
Gore, Al
54. 84%
Uncommitted
10. 16%
LaRouche, Lyndon
http://www.thegreenpapers.com/PCC/AL-R.html
ordinary string
Bush, George W.
44. 100%
Keyes, Alan
Uncommitted
http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13. 68%
Uncommitted
6. 32%
Bradley, Bill
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.