如何从包含多个csv文件链接的网页html中提取特定的csv

Question

我需要从html页面中提取csv文件，请参见下面，一旦我知道可以使用它进行处理。 下面的代码从先前的任务中提取特定的html代码行。 网址为“ https://vincentarelbundock.github.io/Rdatasets/datasets.html ”，它是测试代码，因此在找到该行时会暂时中断。 我的csv行的一部分是href是csv / datasets / co2.csv（我认为是类型的unicode）

怎么打开co2.csv？ 对于任何与格式有关的问题，我们深表歉意。 该代码已被编辑器切片和切块。

import urllib
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *

def scrapper(url,k):
    c=0
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html)
#.    Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        y= (tag.get('href', None))
        #print ((y))
        if y == 'csv/datasets/co2.csv':
            print y
            break
        c= c+ 1

        if c is k:
            return y
            print(type(y))

for w in range(29):
    print(scrapper(url,w))

Answer 1

您正在为循环的所有30次迭代重新下载并重新解析完整的html页面，只是为了获取下一个csv文件并查看是否是您想要的文件。 这是非常低效的，并且对服务器不是很客气。 只需阅读html页面一次，然后遍历已使用过的标签，即可检查该标签是否为您想要的标签！ 如果是这样，请对其进行处理，并停止循环以避免不必要的进一步处理，因为您说过只需要一个特定文件。

与您的问题有关的另一个问题是，在html文件中，csv hrefs是相对URL。 因此，您必须将它们加入到文档所在的基本URL上urlparse.urljoin()就是这样做的。

与问题没有直接关系，但是您也应该尝试清理代码；

修复缩进（第9行的注释）
选择更好的变量名； y / c / k / w毫无意义。

结果如下：

import urllib
import urlparse

url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *


def scraper(url):
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html)
    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href = (tag.get('href', None))
        if href.endswith("/co2.csv"):
            csv_url = urlparse.urljoin(url, href)
            # ... do something with the csv file....
            contents = urllib.urlopen(csv_url).read()
            print "csv file size=", len(contents)
            break   # we only needed this one file, so we end the loop.

scraper(url)

如何从包含多个csv文件链接的网页html中提取特定的csv

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-11-09 23:22:02

如何从包含多个csv文件链接的网页html中提取特定的csv

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-11-09 23:22:02

解决方案1
0 已采纳 2016-11-09 23:22:02