如何使用Python迭代网站的页面？

Question

I'm new to software development, and I'm not sure how to go about this. 我是软件开发的新手，我不知道如何解决这个问题。 I want to visit every page of a website and grab a specific bit of data from each one. 我想访问网站的每个页面，并从每个页面获取一些特定的数据。 My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. 我的问题是，我不知道如何在不知道个人网址的情况下迭代所有现有页面。 For example, I want to visit every page whose url starts with 例如，我想访问其url开头的每个页面

"http://stackoverflow.com/questions/" “http://stackoverflow.com/questions/”

Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls? 有没有办法编译列表，然后迭代，或者是否可以这样做而不创建一个巨大的网址列表？

Answer 1

Try Scrapy . 尝试Scrapy 。

It handles all of the crawling for you and lets you focus on processing the data, not extracting it. 它为您处理所有爬网，让您专注于处理数据，而不是提取数据。 Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it. 而不是复制粘贴教程中已有的代码，我将留给您阅读它。

Answer 2

To grab a specific bit of data from a web site you could use some web scraping tool eg, scrapy . 要从网站获取特定数据，您可以使用一些网络抓取工具，例如scrapy 。

If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand. 如果需要的数据是由javascript生成的，那么您可能需要类似浏览器的工具，例如Selenium WebDriver，并手动实现链接的抓取。

Answer 3

For example, you can make a simple for loop, like this: 例如，您可以创建一个简单的for循环，如下所示：

def webIterate():
    base_link = "http://stackoverflow.com/questions/"
    for i in xrange(24):
        print "http://stackoverflow.com/questions/%d" % (i)

The output will be: 输出将是：

http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23

It's just an example. 这只是一个例子。 You can pass numbers of questions and make with them whatever you want 你可以传递许多问题并随心所欲地制作

如何使用Python迭代网站的页面？

问题描述

3 个解决方案

解决方案1
4 2012-06-14 06:18:30

解决方案2
0 2012-06-14 06:21:15

解决方案3
-2 2012-06-14 06:17:04

如何使用Python迭代网站的页面？

问题描述

3 个解决方案

解决方案1 4 2012-06-14 06:18:30

解决方案2 0 2012-06-14 06:21:15

解决方案3 -2 2012-06-14 06:17:04

解决方案1
4 2012-06-14 06:18:30

解决方案2
0 2012-06-14 06:21:15

解决方案3
-2 2012-06-14 06:17:04