简体   繁体   English

尝试将信息保存在不同的网页上

[英]Trying to save information on different webpages

I have a website where there is information about topic (explaining what it is). 我有一个网站,那里有关于主题的信息(解释它是什么)。 Each topic has its own webpage. 每个主题都有其自己的网页。 Each webpage is set up the same, and I want to retrieve this information all automatically. 每个网页都设置相同,我想自动检索所有这些信息。 I was thinking on using something like wget to grab the info automatically, but Im new with wget so I dont know if it will work nor do I know how I would run it to go to each page and get the information I want. 我当时正在考虑使用wget之类的方法来自动获取信息,但是我是wget的新手,所以我不知道它是否可以工作,也不知道我将如何运行它以进入每个页面并获取我想要的信息。

I hope I've made a little sense here. 我希望我在这里有所了解。 Like I said, my attempt at the problem is using wget and maybe a python script? 就像我说的那样,我尝试解决此问题的方法是使用wget,也许是python脚本? Im not asking for a script on how to do it, just looking for some direction. 我不是在寻找有关如何执行脚本的脚本,只是在寻找一些方向。

Every once in a while I have the same problem, what I usually do is a small script like this: 我偶尔会遇到相同的问题,通常我会做的是一个像这样的小脚本:

url = "www.yoursite.com/topics"
custom_regex = re.compile("insert your a regex here")
req = urllib2.Request(url, headers={"User-Agent": "Magic Browser"})
text = urllib2.urlopen(req).read()
for link in custom_regex.findall(text):
    print link

And then use it like this: 然后像这样使用它:

python script.py > urls.txt
wget -i urls

The -i option tells wget to download all urls listed in a file, one url per line. -i选项告诉wget下载文件中列出的所有URL,每行一个URL。

To retrieve a web page in Python, rather than using wget, I would reccomend using python's urllib2 - https://docs.python.org/2/howto/urllib2.html 要使用Python而不是wget来检索网页,我建议使用python的urllib2- https: //docs.python.org/2/howto/urllib2.html

Once you have retrieved the web page, you can parse it using BeautifulSoup - http://www.crummy.com/software/BeautifulSoup/bs4/doc/ - it will parse the html for you, and you can go right to the pieces of the web page you want. 检索完网页后,您可以使用BeautifulSoup- http ://www.crummy.com/software/BeautifulSoup/bs4/doc/进行解析-它会为您解析html,然后您可以直接进行分析您想要的网页。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM