[英]Scrape websites with python
I have just started python.我刚刚开始 python。 I am trying to web scrape a website to fetch the price and title from it.我正在尝试 web 抓取一个网站以从中获取价格和标题。 I have gone through multiple tutorial and blog, the most common libraries are beautiful soup and scrapy
.我浏览了多个教程和博客,最常见的库是美汤和scrapy
。 My question is that is there any way to scrape a website without using any library?
If there is a way to scrape a website without using any 3rd party library like beautifulsoup
and scrapy
.如果有办法在不使用任何第三方库(如beautifulsoup
和scrapy
)的情况下抓取网站。 It can use builtin libraries
Please suggest me a blog, article or tutorial so that I can learn It can use builtin libraries
请给我推荐一篇博客、文章或教程,以便我学习
Instead of using scrapy
you can use urllib
.您可以使用urllib
代替使用scrapy
。
Instead of beautifulsoup
you can use regex
.您可以使用regex
而不是beautifulsoup
。
But scrapy
and beautifulsoup
do your life easier.但是scrapy
和beautifulsoup
让您的生活更轻松。
Scrapy
, not easy library so you can use requests
or urllib
. Scrapy
,不容易的库,所以你可以使用requests
或urllib
。
i think the best, popular and easy to learn and use libraries in python web scraping are requests, lxml and BeautifulSoup which has the latest version is bs4 in summary 'Requests' lets us make HTML requests to the website's server for retrieving the data on its page. i think the best, popular and easy to learn and use libraries in python web scraping are requests, lxml and BeautifulSoup which has the latest version is bs4 in summary 'Requests' lets us make HTML requests to the website's server for retrieving the data on its页。 Getting the HTML content of a web page is the first and foremost step of web scraping.获取 web 页面的 HTML 内容是 web 抓取的第一步。
Let's take a look at the advantages and disadvantages of the Requests Python library我们来看看Requests Python库的优缺点
Advantages:优点:
Disadvantages:缺点:
We know the requests library cannot parse the HTML retrieved from a web page.我们知道请求库无法解析从 web 页面检索到的 HTML。 Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.因此,我们需要 lxml,一个高性能、超快、生产质量的 HTML 和 XML 解析 Python 库。
Let's take a look at the advantages and disadvantages of the lxml Python library.下面我们来看看lxml Python库的优缺点。
Advantages:优点:
Disadvantages:缺点:
BeautifulSoup is perhaps the most widely used Python library for web scraping. BeautifulSoup 可能是最广泛使用的 Python 库,用于 web 抓取。 It creates a parse tree for parsing HTML and XML documents.它创建一个解析树来解析 HTML 和 XML 文档。 Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Beautiful Soup 自动将传入的文档转换为 Unicode,将传出的文档自动转换为 UTF-8。
One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. Beautiful Soup 库的一个主要优点是它可以很好地与设计不佳的 HTML 配合使用,并且具有很多功能。 The combination of Beautiful Soup and Requests is quite common in the industry. Beautiful Soup 和 Requests 的结合在业界相当普遍。
Advantages:优点:
Disadvantages:缺点:
If you want to learn how to scrape web pages using Beautiful Soup, this tutorial is for you:如果您想学习如何使用 Beautiful Soup 抓取 web 页面,本教程适合您:
by the way there so many libraries you can try like Scrapy, Selenium Library for Web Scraping, regex and urllib顺便说一句,您可以尝试很多库,例如 Scrapy、Selenium 库,用于 Web 抓取、正则表达式和 urllib
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.