繁体   English   中英

使用.txt文件从多个网页中抓取数据,该文件包含带有Python和漂亮汤的URL

[英]Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup

我有一个.txt文件,其中包含指向多个页面的完整URL,每个页面均包含我要从中抓取数据的表。 我的代码适用于一个URL,但是当我尝试添加循环并从.txt文件中读取URL时,出现以下错误

raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?

这是我的代码

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:

    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    page_soup = soup(page_html, "html.parser")

    containers = page_soup.findAll("tr", {"class":"data"})


    for container in containers:
        unform_name = container.findAll("th", {"width":"30%"})
        name = unform_name[0].text.strip()

        unform_delegate = container.findAll("td", {"id":"y000"})
        delegate = unform_delegate[0].text.strip()

        print(name)
        print(delegate)

f.close()

我已经检查了.txt文件,所有条目均正常。 它们以HTTP:开头,以.html结尾。 它们之间没有撇号或引号。 我编码错误的for循环吗?

运用

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)

我得到以下

??http://www.thegreenpapers.com/PCC/AL-D.html

http://www.thegreenpapers.com/PCC/AL-R.html

http://www.thegreenpapers.com/PCC/AK-D.html

依此类推,共有100条线。 仅第一行带有问号。 我的.txt文件包含这些URL,但状态和参与方的缩写仅发生变化。

可以通过在代码中抽动两条不同的行来修复尝试的方法。

尝试这个:

with open('urls.txt', 'r') as f:
    urls = f.readlines()   #make sure this line is properly indented.
for url in urls:
    uClient = urlopen(url.strip())

您无法使用'f.read()'将整个文件读入字符串,然后在该字符串上进行迭代。 要解决此问题,请参见下面的更改。 我也删除了您的最后一行。 当您使用“ with”语句时,它将在代码块完成时关闭文件。

Greg Hewgill的(Python 2)代码显示了url字符串是'str'还是'unicode'类型。

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

# Code from Greg Hewgill
def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)
        whatisthis(url)
        uClient = urlopen(url)
        page_html = uClient.read()
        uClient.close()

        page_soup = soup(page_html, "html.parser")

        containers = page_soup.findAll("tr", {"class":"data"})

        for container in containers:
            unform_name = container.findAll("th", {"width":"30%"})
            name = unform_name[0].text.strip()

            unform_delegate = container.findAll("td", {"id":"y000"})
            delegate = unform_delegate[0].text.strip()

            print(name)
            print(delegate)

使用带有上面列出的URL的文本文件运行代码将产生以下输出:

http://www.thegreenpapers.com/PCC/AL-D.html

ordinary string
Gore, Al
54.   84%
Uncommitted
10.   16%
LaRouche, Lyndon

http://www.thegreenpapers.com/PCC/AL-R.html

ordinary string
Bush, George W.
44.  100%
Keyes, Alan

Uncommitted

http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13.   68%
Uncommitted
6.   32%
Bradley, Bill

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM