简体   繁体   English

Python Web 爬虫和“获取”html 源代码

[英]Python Web Crawlers and "getting" html source code

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html.所以我哥哥想让我用 Python 编写一个网络爬虫(自学),我知道 C++、Java 和一些 html。 I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance.我正在使用 2.7 版并阅读 python 库,但我有一些问题 1. httplib.HTTPConnectionrequest概念对我来说是新的,我不明白它是否下载了 cookie 或实例之类的 html 脚本。 If you do both of those, do you get the source for a website page?如果您同时执行这两项操作,您是否获得了网站页面的来源? And what are some words that I would need to know to modify the page and return the modified page.我需要知道哪些词才能修改页面并返回修改后的页面。

Just for background, I need to download a page and replace any img with ones I have仅作为背景,我需要下载一个页面并将任何 img 替换为我拥有的

And it would be nice if you guys could tell me your opinion of 2.7 and 3.1如果你们能告诉我你们对 2.7 和 3.1 的看法就好了

Use Python 2.7, is has more 3rd party libs at the moment.使用 Python 2.7,目前有更多的 3rd 方库。 ( Edit: see below). 编辑:见下文)。

I recommend you using the stdlib module urllib2 , it will allow you to comfortably get web resources.我建议您使用 stdlib 模块urllib2 ,它可以让您轻松获取网络资源。 Example:例子:

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

For parsing the code, have a look at BeautifulSoup .要解析代码,请查看BeautifulSoup

BTW: what exactly do you want to do:顺便说一句:你到底想做什么:

Just for background, I need to download a page and replace any img with ones I have仅作为背景,我需要下载一个页面并将任何 img 替换为我拥有的

Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can.编辑:现在是 2014 年,大多数重要的库都已移植,如果可以,您绝对应该使用 Python 3。 python-requests is a very nice high-level library which is easier to use than urllib2 . python-requests是一个非常好的高级库,它比urllib2更易于使用。

An Example with python3 and the requests library as mentioned by @leoluk: @leoluk 提到的带有python3requests库的示例:

pip install requests

Script req.py:脚本 req.py:

import requests

url='http://localhost'

# in case you need a session
cd = { 'sessionid': '123..'}

r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content

Now,execute it and you will get the html source of localhost!现在,执行它,您将获得 localhost 的 html 源代码!

python3 req.py

If you are using Python > 3.x you don't need to install any libraries, this is directly built in the python framework.如果您使用的是Python > 3.x ,则无需安装任何库,这是直接在 Python 框架中构建的。 The old urllib2 package has been renamed to urllib :旧的urllib2包已重命名为urllib

from urllib import request

response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)

The first thing you need to do is read the HTTP spec which will explain what you can expect to receive over the wire.您需要做的第一件事是阅读HTTP 规范,该规范将解释您可以通过网络接收到的内容。 The data returned inside the content will be the "rendered" web page, not the source.内容中返回的数据将是“渲染”的网页,而不是源。 The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that.源可以是一个 JSP、一个 servlet、一个 CGI 脚本,简而言之,几乎任何东西,你都无法访问。 You only get the HTML that the server sent you.您只能获得服务器发送给您的 HTML。 In the case of a static HTML page, then yes, you will be seeing the "source".在静态 HTML 页面的情况下,是的,您将看到“源”。 But for anything else you see the generated HTML, not the source.但是对于其他任何内容,您都会看到生成的 HTML,而不是源代码。

When you say modify the page and return the modified page what do you mean?当您说modify the page and return the modified page ,您是什么意思?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM