简体   繁体   English

在HTML表中使用Beautiful Soup查找信息

[英]Find information in HTML tables with Beautiful soup

I'm trying to extract information from an html table (found in this example page https://www.detrasdelafachada.com/house-for-sale-marianao-havana-cuba/dcyktckvwjxhpl9 ): 我正在尝试从html表中提取信息(在此示例页面https://www.detrasdelafachada.com/house-for-sale-marianao-havana-cuba/dcyktckvwjxhpl9中找到 ):

<div class="row">
    <div class="col-label">
        Type of property:
    </div>
    <div class="col-datos">
        Apartment </div>
</div>
<div class="row">
    <div class="col-label">
        Building style:
    </div>
    <div class="col-datos">
        50 year </div>
</div>
<div class="row">
    <div class="col-label precio">
        Sale price:
    </div>
    <div class="col-datos precio">
        12 000 CUC </div>
</div>
<div class="row">
    <div class="col-label">
        Rooms:
    </div>
    <div class="col-datos">
        1 </div>
</div>
<div class="row">
    <div class="col-label">
        Bathrooms:
    </div>
    <div class="col-datos">
        1 </div>
</div>
<div class="row">
    <div class="col-label">
        Kitchens:
    </div>
    <div class="col-datos">
        1 </div>
</div>
<div class="row">
    <div class="col-label">
        Surface:
    </div>
    <div class="col-datos">
        38 mts2 </div>
</div>
<div class="row">
    <div class="col-label">
        Year of construction:
    </div>
    <div class="col-datos">
        1945 </div>
</div>
<div class="row">
    <div class="col-label">
        Building style:
    </div>
    <div class="col-datos">
        50 year </div>
</div>
<div class="row">
    <div class="col-label">
        Construction type:
    </div>
    <div class="col-datos">
        Masonry and plate </div>
</div>
<div class="row">
    <div class="col-label">
        Home conditions:
    </div>
    <div class="col-datos">
        Good </div>
</div>
<div class="row">
    <div class="col-label">
        Other peculiarities:
    </div>
</div>
<div class="row">

Using Beautiful soup, how can I find the value of, say, "Building style:" (among other entries)? 使用美丽汤,如何找到“建筑风格:”的价值(以及其他条目)?

My problem is that I directly find the class since all entries from the table have the same div class name. 我的问题是我直接找到该类,因为表中的所有条目都具有相同的div类名称。

You can iterate over each row div and find the nested div values: 您可以遍历每行div并找到嵌套的div值:

from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser')
results = [[re.sub('\s{2,}|\n+', '', i.text) for i in b.find_all('div')] for b in d.find_all('div', {'class':'row'})]

Output: 输出:

[['Type of property:', 'Apartment '], ['Building style:', '50 year '], ['Sale price:', '12 000 CUC '], ['Rooms:', '1 '], ['Bathrooms:', '1 '], ['Kitchens:', '1 '], ['Surface:', '38 mts2 '], ['Year of construction:', '1945 '], ['Building style:', '50 year '], ['Construction type:', 'Masonry and plate '], ['Home conditions:', 'Good '], ['Other peculiarities:'], []]

If you know that you specifically want to look for the string "Building style:" for example, you can then capture the text of .next_sibling . 例如,如果您知道要查找字符串“ Building style:”,则可以捕获.next_sibling的文本。 Or just use next : 或者只是使用next

>>> from bs4 import BeautifulSoup
>>> html = "<c><div>hello</div> <div>hi</div></c>"
>>> soup = BeautifulSoup(html, 'html.parser')
>>> print(soup.find(string="hello").find_next('div').contents[0])
hi

If you want all of them though, you could use .find_all to get all div tags of class " row ", then grab the children of each. 如果您想全部使用它们,则可以使用.find_all来获取“ row ”类的所有div标签,然后获取每个子标签。

data = []
soup = BeautifulSoup(html, 'html.parser')
for row in soup.find_all('div', class_="row"):
    rowdata = [ c.text.strip() for c in row.find_all('div')]
    data.append(rowdata)
print(data)
# Outputs the nested list:
#   [u'Type of property:', u'Apartment'], [u'Building style:', u'50 year'], etc ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM