简体   繁体   English

使用美丽的汤提取HTML中的嵌套数据

[英]Using Beautiful Soup to Extract Nested Data in HTML

I'm building a web scraper for housing prices in the United States. 我正在为美国的房价建造一个网络刮板。 An example of the data that I'm using can be found here . 我正在使用的数据的示例可以在这里找到。 I'm trying to extract the data for the specific zip code (Studio: $1420, 1 Bedroom: $1560). 我正在尝试提取特定邮政编码的数据(工作室:$ 1420,一间卧室:$ 1560)。

Here is the HTML portion of what I am trying to extract: 这是我尝试提取的HTML部分:

<tspan x="5" y="16" class="highcharts-text-outline" fill="#000000" stroke="#000000" stroke-width="2px" stroke-linejoin="round" style="">$1420</tspan>

When I try to use BeautifulSoup4, I this is what I have: import urllib.request as urllib2 from bs4 import BeautifulSoup 当我尝试使用BeautifulSoup4时,这就是我所拥有的:从bs4导入urllib.request作为urllib2导入BeautifulSoup

# specify the url
quote_page = 'https://www.bestplaces.net/cost_of_living/zip-
code/california/san_diego/92128'

# query the website and return the html to the variable ‘page’
page = urllib2.urlopen(quote_page)


soup = BeautifulSoup(page, 'html.parser')
price = soup.find('tspan', attrs={'class': 'highcharts-text-outline'})

print(price)

But this returns nothing. 但这什么也没有返回。 I am wondering how I can change my command to properly extract this. 我想知道如何更改命令以正确提取它。

You are trying to parse a dynamic content using urllib library which is unable to do the job. 您正在尝试使用无法完成此工作的urllib库解析动态内容。 You need to use any browser simulator like selenium to deal with that. 您需要使用任何浏览器模拟器(例如selenium来处理。 Here is how you can go using selenium : 这是使用selenium

from selenium.webdriver import Chrome
from contextlib import closing

with closing(Chrome()) as driver:
    quote_page = 'https://www.bestplaces.net/cost_of_living/zip-code/california/san_diego/92128'
    driver.get(quote_page)
    price = driver.find_element_by_class_name('highcharts-text-outline').text
    print(price)

Output: 输出:

$1420

You can use the text attribute: 您可以使用text属性:

from bs4 import BeautifulSoup as soup
s = '<tspan x="5" y="16" class="highcharts-text-outline" fill="#000000" stroke="#000000" stroke-width="2px" stroke-linejoin="round" style="">$1420</tspan>'
result = soup(s, 'lxml').find('tspan').text

Output: 输出:

u'$1420'

Try this:- 尝试这个:-

price = soup.find('tspan',{'class':['highcharts-text-outline']})

price.text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM