简体   繁体   English

使用.txt文件从多个网页中抓取数据,该文件包含带有Python和漂亮汤的URL

[英]Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup

I have a .txt file that contains the complete URLs to a number of pages that each contain a table I want to scrape data off of. 我有一个.txt文件,其中包含指向多个页面的完整URL,每个页面均包含我要从中抓取数据的表。 My code works for one URL, but when I try to add a loop and read in the URLs from the .txt file I get the following error 我的代码适用于一个URL,但是当我尝试添加循环并从.txt文件中读取URL时,出现以下错误

raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?

Here's my code 这是我的代码

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:

    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    page_soup = soup(page_html, "html.parser")

    containers = page_soup.findAll("tr", {"class":"data"})


    for container in containers:
        unform_name = container.findAll("th", {"width":"30%"})
        name = unform_name[0].text.strip()

        unform_delegate = container.findAll("td", {"id":"y000"})
        delegate = unform_delegate[0].text.strip()

        print(name)
        print(delegate)

f.close()

I've checked my .txt file and all the entries are normal. 我已经检查了.txt文件,所有条目均正常。 They start with HTTP: and end with .html. 它们以HTTP:开头,以.html结尾。 There are no apostrophes or quotes around them. 它们之间没有撇号或引号。 I'm I coding the for loop incorrectly? 我编码错误的for循环吗?

Using 运用

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)

I get the following 我得到以下

??http://www.thegreenpapers.com/PCC/AL-D.html

http://www.thegreenpapers.com/PCC/AL-R.html

http://www.thegreenpapers.com/PCC/AK-D.html

And so forth on 100 lines. 依此类推,共有100条线。 Only the first line has question marks. 仅第一行带有问号。 My .txt file contains those URLs with only the state and party abbreviation changing. 我的.txt文件包含这些URL,但状态和参与方的缩写仅发生变化。

The way you have tried can be fixed by twitching two different lines in your code. 可以通过在代码中抽动两条不同的行来修复尝试的方法。

Try this: 尝试这个:

with open('urls.txt', 'r') as f:
    urls = f.readlines()   #make sure this line is properly indented.
for url in urls:
    uClient = urlopen(url.strip())

You can't read the whole file into a string using 'f.read()' and then iterate on the string. 您无法使用'f.read()'将整个文件读入字符串,然后在该字符串上进行迭代。 To resolve see the change below. 要解决此问题,请参见下面的更改。 I also removed your last line. 我也删除了您的最后一行。 When you use the 'with' statement it will close the file when the code block finishes. 当您使用“ with”语句时,它将在代码块完成时关闭文件。

Code from Greg Hewgill for (Python 2) shows if the url string is of type 'str' or 'unicode'. Greg Hewgill的(Python 2)代码显示了url字符串是'str'还是'unicode'类型。

from urllib2 import urlopen
from bs4 import BeautifulSoup as soup

# Code from Greg Hewgill
def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

with open('urls.txt', 'r') as f:
    for url in f:
        print(url)
        whatisthis(url)
        uClient = urlopen(url)
        page_html = uClient.read()
        uClient.close()

        page_soup = soup(page_html, "html.parser")

        containers = page_soup.findAll("tr", {"class":"data"})

        for container in containers:
            unform_name = container.findAll("th", {"width":"30%"})
            name = unform_name[0].text.strip()

            unform_delegate = container.findAll("td", {"id":"y000"})
            delegate = unform_delegate[0].text.strip()

            print(name)
            print(delegate)

Running the code with a text file with the URLs listed above produces this output: 使用带有上面列出的URL的文本文件运行代码将产生以下输出:

http://www.thegreenpapers.com/PCC/AL-D.html

ordinary string
Gore, Al
54.   84%
Uncommitted
10.   16%
LaRouche, Lyndon

http://www.thegreenpapers.com/PCC/AL-R.html

ordinary string
Bush, George W.
44.  100%
Keyes, Alan

Uncommitted

http://www.thegreenpapers.com/PCC/AK-D.html
ordinary string
Gore, Al
13.   68%
Uncommitted
6.   32%
Bradley, Bill

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM