简体   繁体   English

无法使用 python 抓取网站

[英]Unable to scrape websites using python

I am practing to Scrape the websites so I choosed a website https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura我正在练习抓取网站,所以我选择了一个网站https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura

Here is the code that I am using这是我正在使用的代码

url="https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura"
r=requests.get(url)
htmlcontnent=r.content 
soup=BeautifulSoup(htmlcontnent,'html.parser')
elem=soup.select('.hozIhp')
print(elem)

Now I am getting output as:现在我得到 output 作为:

[<p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Britannia Sweet Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Britannia Sweet Bun</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiris Cheese Garlic Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiris Fruit Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiris Pav Bun</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Broken Wheat Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Garlic Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Multi Grain Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Whole Wheat Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Whole Wheat Brown Bread</p>]

Hence the output came in the form of list.因此 output 以列表的形式出现。 Now I want to extract the item names such as Britannia Sweet Bread, Britannia Sweet Bun, Nilgiris Cheese Garlic Bread etc. I tried some method such as adding.text with soup but it didn't worked.现在我想提取项目名称,例如 Britannia Sweet Bread、Britannia Sweet Bun、Nilgiris Cheese Garlic Bread 等。我尝试了一些方法,例如 adding.text with soup 但它没有用。 Can someone Please help me how to do that??有人可以帮我怎么做吗?

Try this:试试这个:

url="https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura"
r=requests.get(url)
htmlcontnent=r.content 
soup=BeautifulSoup(htmlcontnent,'html.parser')
elem=soup.select('.hozIhp')
print(*[el.text for el in elem], sep="\n")

Output: Output:

Britannia Sweet Bread
Britannia Sweet Bun
Nilgiris Cheese Garlic Bread
Nilgiris Fruit Bread
Nilgiris Pav Bun
Nilgiri's Broken Wheat Bread
Nilgiri's Garlic Bread
Nilgiri's Multi Grain Bread
Nilgiri's Whole Wheat Bread
Nilgiri's Whole Wheat Brown Bread
url="https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura"
r=requests.get(url)
htmlcontnent=r.content 
soup=BeautifulSoup(htmlcontnent,'html.parser')
elem=soup.select('.hozIhp')
#add to your code
for item in elem:
    print(item.text)

The issue you are having;您遇到的问题; the page is loading dynamically and requests is unable to load the full page页面正在动态加载, requests无法加载整个页面

To fix this, you'll need a little more code first, install selenium using pip install selenium download a compactible Google Chrome webdriver from https://chromedriver.chromium.org/downloads (you must have Google chrome installed on your computer) extract the web driver in the same folder as your python script要解决此问题,您首先需要更多代码,使用pip install seleniumhttps://chromedriver.chromium.org/downloads下载可压缩的 Google Chrome 网络驱动程序(您的计算机上必须安装 Google chrome) web 驱动程序与 python 脚本位于同一文件夹中

Then run this code然后运行这段代码

from bs4 import BeautifulSoup
from selenium import webdriver
import time
 
browser = webdriver.Chrome(executable_path="chromedriver")

url="https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura"
browser.get(url)

#the browser will scroll down for 7 times to load the remaining contents
for i in range (0,6):
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    #waits for 5 seconds the content to load(You can adjust value depending on your internet speed)
    time.sleep(5)

html = browser.page_source
#r=requests.get(url)
#htmlcontnent=r.content 
soup=BeautifulSoup(html,'html.parser')
elem=soup.select('.hozIhp')
for item in elem:
    print(item.text)
    
browser.close()

output output

Britannia Sweet Bread
Britannia Sweet Bun
Nilgiris Cheese Garlic Bread
Nilgiris Fruit Bread
Nilgiris Pav Bun
Nilgiri's Broken Wheat Bread
Nilgiri's Garlic Bread
Nilgiri's Multi Grain Bread
Nilgiri's Whole Wheat Bread
Nilgiri's Whole Wheat Brown Bread
Nilgiri's Milk Bread
Nilgiri's Sandwich Bread
Bajaj White Eggs Gold Pack
Suguna Healthy Eggs
Eggs
Nandini - Shubham Pasteurized Standardized Milk
Nandini Good Life Slim Milk
Nilgiris Lite Milk
Nilgiris Double Toned Milk
Nilgiris Full Cream Milk
Nilgiri's Rich Milk
Amul Premium Dahi
Amul Cheese Slices A+
Cavin's Curd Pouch
Epigamia Mishti Doi
Id Natural Curd
Milky Mist Mango Yogurt
Nilgiris Curd Lite
Nilgiris Low Fat Probiotic Curd
Nilgiris Paneer
Nestle A+ Nourish Dahi
Nilgiris Natural Curd Set
Nilgiris Butter Milk
Nilgiri's Toned Milk Curd Pouch
Nilgiri's Lite Curd Pouch
Nilgiri's Malai Paneer
Soulfull Choco And Vanilla Fills - Ragi Bites
Soulfull Choco Fills - Ragi Bites
Soulfull Vanilla Fills - Ragi Bites
Soulfull Strawberry Fills - Ragi Bites
Soulfull Diet Millet Muesli
Soulfull Fruit & Nut Millet Muesli
Soulfull Crunchy Millet Muesli
Soulfull Baked Desi Muesli - Chatpata
Soulfull Baked Desi Muesli - Masala
Kellogg's Corn Flakes
Fortune Mini Soya Chunks
Kellogg's Chocos Moon And Stars
Soulfull Millet Smoothix - Cocoa Lite Protein Drink Sachets
Soulfull Millet Smoothix - Almond Protein Drink Sachets
Soulfull Millet Smoothix - Almond Protein Drink Sachets
Soulfull Millet Smoothix - Cocoa Lite Protein Drink Sachets

As explained in thedocumentation , you can use get_text() to extract the text from a document or a tag文档中所述,您可以使用get_text()从文档或标签中提取文本

I am practing to Scrape the websites so I choosed a website https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura我正在练习抓取网站,所以我选择了一个网站https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura

Here is the code that I am using这是我正在使用的代码

url="https://www.dunzo.com/bangalore/nilgiris-supermarket-koramangala-ejipura"
r=requests.get(url)
htmlcontnent=r.content 
soup=BeautifulSoup(htmlcontnent,'html.parser')
elem=soup.select('.hozIhp')
print(elem)

Now I am getting output as:现在我得到 output 为:

[<p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Britannia Sweet Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Britannia Sweet Bun</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiris Cheese Garlic Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiris Fruit Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiris Pav Bun</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Broken Wheat Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Garlic Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Multi Grain Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Whole Wheat Bread</p>, <p class="sc-1gu8y64-0 dlNpIS sc-1twyv6b-1 hozIhp">Nilgiri's Whole Wheat Brown Bread</p>]

Hence the output came in the form of list.因此 output 以列表的形式出现。 Now I want to extract the item names such as Britannia Sweet Bread, Britannia Sweet Bun, Nilgiris Cheese Garlic Bread etc. I tried some method such as adding.text with soup but it didn't worked.现在我想提取项目名称,例如 Britannia Sweet Bread、Britannia Sweet Bun、Nilgiris Cheese Garlic Bread 等。我尝试了一些方法,例如在汤中添加.text,但没有成功。 Can someone Please help me how to do that??有人可以帮我怎么做吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM