简体   繁体   English

用 Beautiful Soup 选择 div

[英]Selecting div with Beautiful Soup

Hello I have such an html, when I parse it with Beautiful Soup I am not able to select the class text.你好,我有这样一个 html,当我用 Beautiful Soup 解析它时,我无法 select class 文本。 Think that the problem is in as nested tags are not recognized as children of it.认为问题在于嵌套标签不被识别为它的子标签。 How can I select the span tag text?我怎样才能 select 跨度标记文本?

Thanks谢谢

<div data-component="new_enquiry_form_app" data-props="{"isTelRequired":false,"placement":"top",}">
  <section class="enquiry-form-box__wrapper">
    <div class="enquiry-form-box enquiry-form-box--inverted"> 
      <form class="enquiry-form-box__form" tabindex="-1">
        <fieldset class="enquiry-form-box__wrapper">
          <div class="enquiry-form-box__fields">
            <div class="k-ns">
              <span class="text-gray block mt-3 font-bold text-sm">Property reference: 412</span>
            </div>
          </div>
        </fieldset>
      </form>
    </div>
  </section>

Try this:试试这个:

from bs4 import BeautifulSoup

html = '''<div data-component="new_enquiry_form_app" data-props="{"isTelRequired":false,"placement":"top",}">
  <section class="enquiry-form-box__wrapper">
    <div class="enquiry-form-box enquiry-form-box--inverted"> 
      <form class="enquiry-form-box__form" tabindex="-1">
        <fieldset class="enquiry-form-box__wrapper">
          <div class="enquiry-form-box__fields">
            <div class="k-ns">
              <span class="text-gray block mt-3 font-bold text-sm">Property reference: 412</span>
            </div>
          </div>
        </fieldset>
      </form>
    </div>
  </section>'''
soup = BeautifulSoup(html, 'html.parser')
span = soup.select_one('span.text-gray.block.mt-3.font-bold.text-sm')
print(span.get_text())

prints:印刷:

Property reference: 412

Then this is one way:那么这是一种方式:

from selenium import webdriver
driver = webdriver.Firefox(executable_path='c:program/geckodriver')
driver.get('https://www.kyero.com/en/property/7689206-villa-for-sale-sant-joan-de-labritja')

span = driver.find_element_by_css_selector('span.text-gray.block.mt-3.font-bold.text-sm')
print(span.text)
driver.close()

prints:印刷:

Property reference: 412

Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe @Andrej Kesely was faster with the other answer so i give a selenium answer.请注意,您需要seleniumgeckodriver ,在此代码中,geckodriver 设置为从c:/program/geckodriver.exe @Andrej Kesely 的另一个答案更快,所以我给出了 selenium 答案。

To print the reference label, you can use this script (the data is stored in javascript variable inside the HTML document):要打印引用 label,您可以使用此脚本(数据存储在 HTML 文档中的 javascript 变量中):

import re
import json
import requests


url = 'https://www.kyero.com/en/property/7689206-villa-for-sale-sant-joan-de-labritja'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
html_text = requests.get(url, headers=headers).text
data = json.loads( re.search(r'window\.initialState = (.*);', html_text).group(1) )

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

print(data['property']['referenceLabel'])

Prints:印刷:

Property reference: 412

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM