简体   繁体   中英

How to extract content from tags with Beautiful soup

I have been trying to practice web-scraping with beautiful soup. But everytime I changed a website, the tags structure are so different which really confuses me. This time I am trying to scrap the amazon best seller site ( https://www.amazon.com/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1 ) for the ranking, Name, rating, as well as the number of review(Circled in the picture below). 在此处输入图片说明

My idea is to find the "main" tag for each item and dig into the tag that has the information I want. So I used .select() and started with the "li class". But when I try to add tags after "span.a-list-item", I then get empty result with the following code,

container = page.select('li.zg-item-immersion > span.a-list-item > div.a-section a-spacing-none aok-relative' )

Is there a tag limit I can put into .select() or am I doing something wrong?

So I stopped at "span. a-list-item" and tried the following approach, but I don't understand why my code sometimes gives me the empty result and sometimes returns the things I want... I guess this is something related to the connection to the website?

from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.com/Best-Sellers-Appstore- 
Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
page = BeautifulSoup(requests.get(url).content,'lxml')    
containers = page.select('li.zg-item-immersion > span.a-list-item')
ranking = (containers[1].find("span",class_="zg-badge-text").text)[1:]

On the last line, I was able to get the ranking number successfully with that line of code, but when I try to append them into a list with a loop,

for item in range(50):
   ranking.append((containers[item].find("span",class_="zg-badge-text").text)[1:])

I keep getting "list index out of range" error which I don't understand why it is out of range as there is 50 items on a single page.

Last but not least, can I please get some advice on learning to scape different websites? I also read the beautifulsoup document and follow the instruction on using the different functions to get to a specific tag but still not getting what I want...

Actually, after for loop it didn't grab data from a range of list as text. You also need to inject user agent as headers.

Code:

from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
url = "https://www.amazon.com/Best-Sellers-Appstore- Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
r =requests.get(url, headers = headers)
page = BeautifulSoup(r.content,'lxml') 

containers = page.select('li.zg-item-immersion > span.a-list-item')
for container in containers:
    ranking = container.find("span",class_="zg-badge-text").text
    print(ranking)

Output:

#1
#2 
#3 
#4 
#5 
#6 
#7 
#8 
#9 
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
#26
#27
#28
#29
#30
#31
#32
#33
#34
#35
#36
#37
#38
#39
#40
#41
#42
#43
#44
#45
#46
#47
#48
#49
#50

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM