I'm trying to extract from the a webpage which has many of the following divs (obviously all with different data, except for the initial part):
<div data-asin="B007R2E578" data-index="0"
class="sg-col-20-of-24 s-result-item sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 AdHolder sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28">
<div class="sg-col-inner">
All those divs, start identically with: <div data-asin=
I'm trying to extract all of them with the find_all function from Beautifulsoup:
structure = soup.find_all('div','data-asin=')
However it always return an empty list.
I don't want to use regex.
Is there any function in BeautifulSoup that can get all those divs?
You could use CSS selector div[data-asin]
(select all <div>
where data-asin
attribute is present):
data = '''<div data-asin="B007R2E578" data-index="0"
class="sg-col-20-of-24 s-result-item sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 AdHolder sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28">
<div class="sg-col-inner">
SOME DATA
</div>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for div in soup.select('div[data-asin]'):
print(div['data-asin'], div.get_text(strip=True))
Prints:
B007R2E578 SOME DATA
Further reading:
EDIT: To get some data from Amazon:
from bs4 import BeautifulSoup
import requests
url = 'https://www.amazon.com/s?k=iron&ref=nb_sb_noss_2'
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
for div in soup.select('div[data-asin]'):
print(div['data-asin'])
if div.select_one('.a-price'):
print(div.select_one('.a-price ').get_text('|',strip=True).split('|')[0])
if div.select_one('.a-text-normal'):
print(div.select_one('.a-text-normal').text)
Prints:
B004ILTH1K
$62.81
Rowenta DW5080 1700-Watt Micro Steam Iron Stainless Steel Soleplate with Auto-Off, 400-Hole, Brown
B00OL5P1G8
$21.99
Sunbeam Steam Master 1400 Watt Mid-size Anti-Drip Non-Stick Soleplate Iron with Variable Steam control and 8' Retractable Cord, Black/Blue, GCSBCL-202-000
...etc.
Find all the div tags and then do a list comprehension that will put that atrribute value into a list, if it has that attribute:
html = '''<div data-asin="B007R2E578" data-index="0"
class="sg-col-20-of-24 s-result-item sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 AdHolder sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28">
<div class="sg-col-inner">'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all('div')
a_list = [ div['data-asin'] for div in divs if div.has_attr('data-asin')]
This gives you all divs then filter
$(':div').each(function(){
Var ele = $(this);
});
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.