简体   繁体   English

如何使用Python迭代网站的页面?

[英]How can I iterate through the pages of a website using Python?

I'm new to software development, and I'm not sure how to go about this. 我是软件开发的新手,我不知道如何解决这个问题。 I want to visit every page of a website and grab a specific bit of data from each one. 我想访问网站的每个页面,并从每个页面获取一些特定的数据。 My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. 我的问题是,我不知道如何在不知道个人网址的情况下迭代所有现有页面。 For example, I want to visit every page whose url starts with 例如,我想访问其url开头的每个页面

"http://stackoverflow.com/questions/" “http://stackoverflow.com/questions/”

Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls? 有没有办法编译列表,然后迭代,或者是否可以这样做而不创建一个巨大的网址列表?

Try Scrapy . 尝试Scrapy

It handles all of the crawling for you and lets you focus on processing the data, not extracting it. 它为您处理所有爬网,让您专注于处理数据,而不是提取数据。 Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it. 而不是复制粘贴教程中已有的代码,我将留给您阅读它。

To grab a specific bit of data from a web site you could use some web scraping tool eg, scrapy . 要从网站获取特定数据,您可以使用一些网络抓取工具,例如scrapy

If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand. 如果需要的数据是由javascript生成的,那么您可能需要类似浏览器的工具,例如Selenium WebDriver,并手动实现链接的抓取。

For example, you can make a simple for loop, like this: 例如,您可以创建一个简单的for循环,如下所示:

def webIterate():
    base_link = "http://stackoverflow.com/questions/"
    for i in xrange(24):
        print "http://stackoverflow.com/questions/%d" % (i)

The output will be: 输出将是:

http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23

It's just an example. 这只是一个例子。 You can pass numbers of questions and make with them whatever you want 你可以传递许多问题并随心所欲地制作

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM