简体   繁体   中英

Scrape websites with python

I have just started python. I am trying to web scrape a website to fetch the price and title from it. I have gone through multiple tutorial and blog, the most common libraries are beautiful soup and scrapy . My question is that is there any way to scrape a website without using any library? If there is a way to scrape a website without using any 3rd party library like beautifulsoup and scrapy . It can use builtin libraries Please suggest me a blog, article or tutorial so that I can learn

Instead of using scrapy you can use urllib .

Instead of beautifulsoup you can use regex .

But scrapy and beautifulsoup do your life easier.

Scrapy , not easy library so you can use requests or urllib .

i think the best, popular and easy to learn and use libraries in python web scraping are requests, lxml and BeautifulSoup which has the latest version is bs4 in summary 'Requests' lets us make HTML requests to the website's server for retrieving the data on its page. Getting the HTML content of a web page is the first and foremost step of web scraping.

Let's take a look at the advantages and disadvantages of the Requests Python library

Advantages:

  • Simple
  • Basic/Digest Authentication
  • International Domains and URLs
  • Chunked Requests
  • HTTP(S) Proxy Support

Disadvantages:

  • Retrieves only static content of a page
  • Can't be used for parsing HTML
  • Can't handle websites made purely with JavaScript

We know the requests library cannot parse the HTML retrieved from a web page. Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.

Let's take a look at the advantages and disadvantages of the lxml Python library.

Advantages:

  • Faster than most of the parser out there
  • Light-weight
  • Uses element trees
  • Pythonic API

Disadvantages:

  • Does not work well with poorly designed HTML
  • The official documentation is not very beginner-friendly

BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.

One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. The combination of Beautiful Soup and Requests is quite common in the industry.

Advantages:

  • Requires a few lines of code
  • Great documentation
  • Easy to learn for beginners
  • Robust
  • Automatic encoding detection

Disadvantages:

  • Slower than lxml

If you want to learn how to scrape web pages using Beautiful Soup, this tutorial is for you:

turtorial

by the way there so many libraries you can try like Scrapy, Selenium Library for Web Scraping, regex and urllib

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM