简体   繁体   中英

Using BeautifulSoup to extract text from div

I am using the following snippet and attempting to parse a section of html from the link below, namely the div appears like:

<div id="avg-price" class="price big-price">4.02</div>
<div id="best-price" class="price big-price">0.20</div>
<div id="worst-price" class="price big-price">15.98</div>

This is the code that I am attempting to use

import requests, urllib.parse
from bs4 import BeautifulSoup, element
r = requests.get('https://herf.io/bids?search=tatuaje%20tattoo')
soup = BeautifulSoup(r.text, 'html.parser')

avgPrice = soup.find("div", {"id": "avg-price"})
lowPrice = soup.find("div", {"id": "best-price"})
highPrice = soup.find("div", {"id": "worst-price"})

print(avgPrice)
print(lowPrice)
print(highPrice)
print("Average Price: {}".format(avgPrice))
print("Low Price: {}".format(lowPrice))
print("High Price: {}".format(highPrice))

However, it does not include the price between the divs... the result looks like:

<div class="price big-price" id="avg-price"></div>
<div class="price big-price" id="best-price"></div>
<div class="price big-price" id="worst-price"></div>
Average Price: <div class="price big-price" id="avg-price"></div>
Low Price: <div class="price big-price" id="best-price"></div>
High Price: <div class="price big-price" id="worst-price"></div>

Any ideas? I'm sure i'm overlooking something small but i'm at wits end right now haha.

of course you can, but only when the data did not need to calculate by javascrip. IS NOW! In this website you can use fiddler to figure out which url did javascrip use to load data, then you can get json or other from it. This is an easy example, after i using fiddler to find out where data came from. Remember you need to set verify=False when you use fiddler cert.

import requests 

with requests.Session() as se:
    se.headers = {
        "X-Requested-With": "XMLHttpRequest",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "Referer": "https://herf.io/bids?search=tatuaje%20tattoo",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Accept-Encoding":"gzip, deflate, br",
        }
    data = [
        "search=tatuaje+tattoo",
        "types=",
        "sites=",
    ]

    cookies = {
        "Cookie": "connect.sid=s%3ANYNh5s6LzCVWY8yE9Gra8lxj9OGHPAK_.vGiBmTXvfF4iDScBF94YOXFDmC80PQxY%2FX9FLQ23hYI"}

    url = "https://herf.io/bids/search/open"

    price = "https://herf.io/bids/search/stats"

    req = se.post(price,data="&".join(data),cookies=cookies,verify=False)
    print(req.text)

Output

{"bottomQuarter":4.4,"topQuarter":3.31,"median":3.8,"mean":4.03,"stddev":1.44,"moe":0.08,"good":2.59,"great":1.14,"poor":5.47,"bad":6.91,"best":0.2,"worst":15.98,"count":1121}

You can strip out the text using the text attribute:

print("Average Price: {}".format(avgPrice.text))
print("Low Price: {}".format(lowPrice.text))
print("High Price: {}".format(highPrice.text))

Try

avgPrice[0].text 

For the rest, do the same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM