尝试将信息保存在不同的网页上

Question

I have a website where there is information about topic (explaining what it is). 我有一个网站，那里有关于主题的信息（解释它是什么）。 Each topic has its own webpage. 每个主题都有其自己的网页。 Each webpage is set up the same, and I want to retrieve this information all automatically. 每个网页都设置相同，我想自动检索所有这些信息。 I was thinking on using something like wget to grab the info automatically, but Im new with wget so I dont know if it will work nor do I know how I would run it to go to each page and get the information I want. 我当时正在考虑使用wget之类的方法来自动获取信息，但是我是wget的新手，所以我不知道它是否可以工作，也不知道我将如何运行它以进入每个页面并获取我想要的信息。

I hope I've made a little sense here. 我希望我在这里有所了解。 Like I said, my attempt at the problem is using wget and maybe a python script? 就像我说的那样，我尝试解决此问题的方法是使用wget，也许是python脚本？ Im not asking for a script on how to do it, just looking for some direction. 我不是在寻找有关如何执行脚本的脚本，只是在寻找一些方向。

Answer 1

Every once in a while I have the same problem, what I usually do is a small script like this: 我偶尔会遇到相同的问题，通常我会做的是一个像这样的小脚本：

url = "www.yoursite.com/topics"
custom_regex = re.compile("insert your a regex here")
req = urllib2.Request(url, headers={"User-Agent": "Magic Browser"})
text = urllib2.urlopen(req).read()
for link in custom_regex.findall(text):
    print link

And then use it like this: 然后像这样使用它：

python script.py > urls.txt
wget -i urls

The -i option tells wget to download all urls listed in a file, one url per line. -i选项告诉wget下载文件中列出的所有URL，每行一个URL。

Answer 2

To retrieve a web page in Python, rather than using wget, I would reccomend using python's urllib2 - https://docs.python.org/2/howto/urllib2.html 要使用Python而不是wget来检索网页，我建议使用python的urllib2- https: //docs.python.org/2/howto/urllib2.html

Once you have retrieved the web page, you can parse it using BeautifulSoup - http://www.crummy.com/software/BeautifulSoup/bs4/doc/ - it will parse the html for you, and you can go right to the pieces of the web page you want. 检索完网页后，您可以使用BeautifulSoup- http ://www.crummy.com/software/BeautifulSoup/bs4/doc/进行解析-它会为您解析html，然后您可以直接进行分析您想要的网页。

尝试将信息保存在不同的网页上

问题描述

2 个解决方案

解决方案1
2 2015-01-14 22:05:24

解决方案2
1 2015-01-14 21:58:38

尝试将信息保存在不同的网页上

问题描述

2 个解决方案

解决方案1 2 2015-01-14 22:05:24

解决方案2 1 2015-01-14 21:58:38

解决方案1
2 2015-01-14 22:05:24

解决方案2
1 2015-01-14 21:58:38