我需要网络抓取的帮助

Question

因此，我想从visual.ly中抓取可视化内容，但是现在我还不了解“显示更多”按钮的工作方式。 到目前为止，我的代码将获得图像链接，图像旁边的文本以及页面的链接。 我想知道“显示更多”按钮的功能是什么，因为我将尝试遍历页面数。 截至目前，我不知道我将如何逐一遍历每个人。 关于如何循环浏览并继续获取比原始显示更多的图像的任何想法？？？？

from BeautifulSoup import BeautifulSoup
import urllib2  
import HTMLParser
import urllib, re

counter = 1
columnno = 1
parser = HTMLParser.HTMLParser()

soup = BeautifulSoup(urllib2.urlopen('http://visual.ly/?view=explore&   type=static#v2_filter').read())

image = soup.findAll("div", attrs = {'class': 'view-mode-wrapper'})

if columnno < 4:
    column = image[0].findAll("div", attrs = {'class': 'v2_grid_column'})
    columnno += 1
else:
    column = image[0].findAll("div", attrs = {'class': 'v2_grid_column last'})

visualizations = column[0].findAll("div", attrs = {'class': '0 v2_grid_item viewmode-item'})

getImage = visualizations[0].find("a")

print counter

print getImage['href']

soup1 = BeautifulSoup(urllib2.urlopen(getImage['href']).read())

theImage = soup1.findAll("div", attrs = {'class': 'ig-graphic-wrapper'})

text = soup1.findAll("div", attrs = {'class': 'ig-content-right'})

getText = text[0].findAll("div", attrs = {'class': 'ig-description right-section first'})

imageLink = theImage[0].find("a")

print imageLink['href']

print getText

for row in image:
    theImage = image[0].find("a")

    actually_download = False
    if actually_download:
        filename = link.split('/')[-1]
        urllib.urlretrieve(link, filename)

counter += 1

Answer 1

您不能在此处使用urllib-parser组合，因为它使用javascript加载了更多内容。 为此，您将需要一个全功能的浏览器模拟器（具有javascript支持）。 我以前从未使用过Selenium ，但是我听说它可以做到这一点，并且具有python绑定

但是，我发现它使用了非常可预测的形式

http://visual.ly/?page=<page_number>

用于其GET请求。 也许更简单的方法是进入

<div class="view-mode-wrapper">...</div>

解析数据（使用上面的url格式）。 毕竟，ajax请求必须转到某个位置。

那你可以做

for i in xrange(<whatever>):
    url = r'http://visual.ly/?page={pagenum}'.format(pagenum=i)
    #do whatever you want from here

我需要网络抓取的帮助

问题描述

1 个解决方案

解决方案1
1 2012-08-03 18:32:53

我需要网络抓取的帮助

问题描述

1 个解决方案

解决方案1 1 2012-08-03 18:32:53

解决方案1
1 2012-08-03 18:32:53