簡體   English   中英

用 Beautiful Soup 選擇 div

[英]Selecting div with Beautiful Soup

你好,我有這樣一個 html,當我用 Beautiful Soup 解析它時,我無法 select class 文本。 認為問題在於嵌套標簽不被識別為它的子標簽。 我怎樣才能 select 跨度標記文本?

謝謝

<div data-component="new_enquiry_form_app" data-props="{"isTelRequired":false,"placement":"top",}">
  <section class="enquiry-form-box__wrapper">
    <div class="enquiry-form-box enquiry-form-box--inverted"> 
      <form class="enquiry-form-box__form" tabindex="-1">
        <fieldset class="enquiry-form-box__wrapper">
          <div class="enquiry-form-box__fields">
            <div class="k-ns">
              <span class="text-gray block mt-3 font-bold text-sm">Property reference: 412</span>
            </div>
          </div>
        </fieldset>
      </form>
    </div>
  </section>

試試這個:

from bs4 import BeautifulSoup

html = '''<div data-component="new_enquiry_form_app" data-props="{"isTelRequired":false,"placement":"top",}">
  <section class="enquiry-form-box__wrapper">
    <div class="enquiry-form-box enquiry-form-box--inverted"> 
      <form class="enquiry-form-box__form" tabindex="-1">
        <fieldset class="enquiry-form-box__wrapper">
          <div class="enquiry-form-box__fields">
            <div class="k-ns">
              <span class="text-gray block mt-3 font-bold text-sm">Property reference: 412</span>
            </div>
          </div>
        </fieldset>
      </form>
    </div>
  </section>'''
soup = BeautifulSoup(html, 'html.parser')
span = soup.select_one('span.text-gray.block.mt-3.font-bold.text-sm')
print(span.get_text())

印刷:

Property reference: 412

那么這是一種方式:

from selenium import webdriver
driver = webdriver.Firefox(executable_path='c:program/geckodriver')
driver.get('https://www.kyero.com/en/property/7689206-villa-for-sale-sant-joan-de-labritja')

span = driver.find_element_by_css_selector('span.text-gray.block.mt-3.font-bold.text-sm')
print(span.text)
driver.close()

印刷:

Property reference: 412

請注意,您需要seleniumgeckodriver ,在此代碼中,geckodriver 設置為從c:/program/geckodriver.exe @Andrej Kesely 的另一個答案更快,所以我給出了 selenium 答案。

要打印引用 label,您可以使用此腳本(數據存儲在 HTML 文檔中的 javascript 變量中):

import re
import json
import requests


url = 'https://www.kyero.com/en/property/7689206-villa-for-sale-sant-joan-de-labritja'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
html_text = requests.get(url, headers=headers).text
data = json.loads( re.search(r'window\.initialState = (.*);', html_text).group(1) )

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

print(data['property']['referenceLabel'])

印刷:

Property reference: 412

隨機的

[英]Random </div> interfering with Beautiful Soup

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM