简体繁体 English

使用 python 抓取网站

[英]Scrape websites with python

原文 2020-06-09 16:59:48 8 2 python/ web-scraping/ beautifulsoup/ scrapy/ libraries

I have just started python.我刚刚开始 python。 I am trying to web scrape a website to fetch the price and title from it.我正在尝试 web 抓取一个网站以从中获取价格和标题。 I have gone through multiple tutorial and blog, the most common libraries are beautiful soup and scrapy .我浏览了多个教程和博客，最常见的库是美汤和scrapy 。 My question is that is there any way to scrape a website without using any library? If there is a way to scrape a website without using any 3rd party library like beautifulsoup and scrapy .如果有办法在不使用任何第三方库（如beautifulsoup和scrapy ）的情况下抓取网站。 It can use builtin libraries Please suggest me a blog, article or tutorial so that I can learn It can use builtin libraries请给我推荐一篇博客、文章或教程，以便我学习

2 个解决方案

Instead of using scrapy you can use urllib .您可以使用urllib代替使用scrapy 。

Instead of beautifulsoup you can use regex .您可以使用regex而不是beautifulsoup 。

But scrapy and beautifulsoup do your life easier.但是scrapy和beautifulsoup让您的生活更轻松。

Scrapy , not easy library so you can use requests or urllib . Scrapy ，不容易的库，所以你可以使用requests或urllib 。

i think the best, popular and easy to learn and use libraries in python web scraping are requests, lxml and BeautifulSoup which has the latest version is bs4 in summary 'Requests' lets us make HTML requests to the website's server for retrieving the data on its page. i think the best, popular and easy to learn and use libraries in python web scraping are requests, lxml and BeautifulSoup which has the latest version is bs4 in summary 'Requests' lets us make HTML requests to the website's server for retrieving the data on its页。 Getting the HTML content of a web page is the first and foremost step of web scraping.获取 web 页面的 HTML 内容是 web 抓取的第一步。

Let's take a look at the advantages and disadvantages of the Requests Python library我们来看看Requests Python库的优缺点

Advantages:优点：

Simple简单的
Basic/Digest Authentication基本/摘要认证
International Domains and URLs国际域名和 URL
Chunked Requests分块请求
HTTP(S) Proxy Support HTTP(S) 代理支持

Disadvantages:缺点：

Retrieves only static content of a page仅检索页面的 static 内容
Can't be used for parsing HTML不能用于解析 HTML
Can't handle websites made purely with JavaScript无法处理纯粹使用 JavaScript 制作的网站

We know the requests library cannot parse the HTML retrieved from a web page.我们知道请求库无法解析从 web 页面检索到的 HTML。 Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.因此，我们需要 lxml，一个高性能、超快、生产质量的 HTML 和 XML 解析 Python 库。

Let's take a look at the advantages and disadvantages of the lxml Python library.下面我们来看看lxml Python库的优缺点。

Advantages:优点：

Faster than most of the parser out there比那里的大多数解析器更快
Light-weight轻的
Uses element trees使用元素树
Pythonic API Pythonic API

Disadvantages:缺点：

Does not work well with poorly designed HTML不适用于设计不良的 HTML
The official documentation is not very beginner-friendly官方文档对初学者不太友好

BeautifulSoup is perhaps the most widely used Python library for web scraping. BeautifulSoup 可能是最广泛使用的 Python 库，用于 web 抓取。 It creates a parse tree for parsing HTML and XML documents.它创建一个解析树来解析 HTML 和 XML 文档。 Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Beautiful Soup 自动将传入的文档转换为 Unicode，将传出的文档自动转换为 UTF-8。

One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. Beautiful Soup 库的一个主要优点是它可以很好地与设计不佳的 HTML 配合使用，并且具有很多功能。 The combination of Beautiful Soup and Requests is quite common in the industry. Beautiful Soup 和 Requests 的结合在业界相当普遍。

Advantages:优点：

Requires a few lines of code需要几行代码
Great documentation很棒的文档
Easy to learn for beginners易于初学者学习
Robust强大的
Automatic encoding detection自动编码检测

Disadvantages:缺点：

Slower than lxml比 lxml 慢

If you want to learn how to scrape web pages using Beautiful Soup, this tutorial is for you:如果您想学习如何使用 Beautiful Soup 抓取 web 页面，本教程适合您：

turtorial 教程

by the way there so many libraries you can try like Scrapy, Selenium Library for Web Scraping, regex and urllib顺便说一句，您可以尝试很多库，例如 Scrapy、Selenium 库，用于 Web 抓取、正则表达式和 urllib