繁体 English 中英

使用 python 抓取网站

[英]Scrape websites with python

原文 2020-06-09 16:59:48 6 2 python/ web-scraping/ beautifulsoup/ scrapy/ libraries

我刚刚开始 python。 我正在尝试 web 抓取一个网站以从中获取价格和标题。 我浏览了多个教程和博客，最常见的库是美汤和scrapy 。 My question is that is there any way to scrape a website without using any library? 如果有办法在不使用任何第三方库（如beautifulsoup和scrapy ）的情况下抓取网站。 It can use builtin libraries请给我推荐一篇博客、文章或教程，以便我学习

2 个解决方案

您可以使用urllib代替使用scrapy 。

您可以使用regex而不是beautifulsoup 。

但是scrapy和beautifulsoup让您的生活更轻松。

Scrapy ，不容易的库，所以你可以使用requests或urllib 。

i think the best, popular and easy to learn and use libraries in python web scraping are requests, lxml and BeautifulSoup which has the latest version is bs4 in summary 'Requests' lets us make HTML requests to the website's server for retrieving the data on its页。 获取 web 页面的 HTML 内容是 web 抓取的第一步。

我们来看看Requests Python库的优缺点

优点：

简单的
基本/摘要认证
国际域名和 URL
分块请求
HTTP(S) 代理支持

缺点：

仅检索页面的 static 内容
不能用于解析 HTML
无法处理纯粹使用 JavaScript 制作的网站

我们知道请求库无法解析从 web 页面检索到的 HTML。 因此，我们需要 lxml，一个高性能、超快、生产质量的 HTML 和 XML 解析 Python 库。

下面我们来看看lxml Python库的优缺点。

优点：

比那里的大多数解析器更快
轻的
使用元素树
Pythonic API

缺点：

不适用于设计不良的 HTML
官方文档对初学者不太友好

BeautifulSoup 可能是最广泛使用的 Python 库，用于 web 抓取。 它创建一个解析树来解析 HTML 和 XML 文档。 Beautiful Soup 自动将传入的文档转换为 Unicode，将传出的文档自动转换为 UTF-8。

Beautiful Soup 库的一个主要优点是它可以很好地与设计不佳的 HTML 配合使用，并且具有很多功能。 Beautiful Soup 和 Requests 的结合在业界相当普遍。

优点：

需要几行代码
很棒的文档
易于初学者学习
强大的
自动编码检测

缺点：

比 lxml 慢

如果您想学习如何使用 Beautiful Soup 抓取 web 页面，本教程适合您：

顺便说一句，您可以尝试很多库，例如 Scrapy、Selenium 库，用于 Web 抓取、正则表达式和 urllib

如何使用 Python 登录和抓取网站？

[英]How to Login and Scrape Websites with Python?

无法使用 python 抓取网站

[英]Unable to scrape websites using python

如何用 Python 和漂亮的汤来抓取网站

[英]How to scrape websites with Python and beautiful soup

试过Python BeautifulSoup和Phantom JS：STILL无法抓取网站

[英]Tried Python BeautifulSoup and Phantom JS: STILL can't scrape websites

是否可以自动从网站上抓取文章-Python和精美汤

[英]Is it possible to automatically scrape articles from websites - Python & Beautiful Soup

使用 Selenium 在 Python 中抓取 Java 重型网站的更新

[英]Update on Using Selenium To Scrape Java Heavy Websites in Python

如何使用 Python 抓取嵌入在网站中的表格 web

[英]How to web scrape tables embedded in websites using Python

使用BeautifulSoup抓取网站

[英]scrape websites using BeautifulSoup

用无限滚动抓取网站

[英]scrape websites with infinite scrolling

使用scrapy刮网站

[英]Scrape websites using scrapy

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Python 登录和抓取网站？无法使用 python 抓取网站如何用 Python 和漂亮的汤来抓取网站试过Python BeautifulSoup和Phantom JS：STILL无法抓取网站是否可以自动从网站上抓取文章-Python和精美汤使用 Selenium 在 Python 中抓取 Java 重型网站的更新如何使用 Python 抓取嵌入在网站中的表格 web 使用BeautifulSoup抓取网站用无限滚动抓取网站使用scrapy刮网站

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM