简体   繁体   English

无法弄清楚如何使用漂亮的汤(Python)在 body 标签中抓取数据

[英]Can't figure out how to scrape data in body tag using beautiful soup (Python)

from bs4 import BeautifulSoup
import urllib
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.cell import get_column_letter

r = urllib.urlopen('https://www.vrbo.com/576329').read()
soup = BeautifulSoup(r)
rate = soup.find_all('body')

print rate
print type(soup)

I'm trying to capture values in containers such as data-bedrooms="3", specifically the values given in the quotations, but I have no idea what they are formally called or how to parse them.我试图在诸如 data-bedrooms="3" 之类的容器中捕获值,特别是引用中给出的值,但我不知道它们的正式名称或如何解析它们。

The below is a sample of part of the print out for the "body" so I know the values are there, the capturing the specific part is what I can't get:下面是“身体”部分打印输出的示例,所以我知道值在那里,捕获特定部分是我无法得到的:

data-ratemaximum="$260" data-rateminimum="$220" data-rateunits="night" data-rawlistingnumber="576329" data-requestuuid="73bcfaa3-9637-40a8-801c-ae86f93caf39" data-searchpdptab="C" data-serverday="18" data-showbookingphone="False" data-ratemaximum="$260" data-rateminimum="$220" data-rateunits="night" data-rawlistingnumber="576329" data-requestuuid="73bcfaa3-9637-40a8-801c-ae86f93caf39" data-searchpdptab="C" data-serverday="18" data-showbookingphone="False"

To obtain the value of an attribute used rate [ 'attr'], example:要获取属性使用率 ['attr'] 的值,例如:

 from bs4 import BeautifulSoup import urllib from openpyxl import Workbook from openpyxl.compat import range from openpyxl.cell import get_column_letter r = urllib.urlopen('https://www.vrbo.com/576329').read() soup = BeautifulSoup(r, "html.parser") rate = soup.find('body') print rate['data-ratemaximum'] print rate['data-rateunits'] print rate['data-rawlistingnumber'] print rate['data-requestuuid'] print rate['data-searchpdptab'] print rate['data-serverday'] print rate['data-searchpdptab'] print rate['data-showbookingphone'] print rate print type(soup)

You need to pick apart your result.你需要把你的结果分开。 It might be helpful to know that those things you seek are called attributes of a tag in HTML:知道您寻求的那些东西在 HTML 中称为标记的属性可能会有所帮助:

body_tag = rate[0]
data_bedrooms = body_tag.attrs['data-bedrooms']

The code above assumes you only have one <body> -- if you have more you will need to use a for loop on rate .上面的代码假设您只有一个<body> - 如果您有更多,则需要在rate上使用for循环。 You'll also possibly want to convert the value to an integer with int() .您还可能希望使用int()将值转换为整数。

Not sure if you wanted only data-bedrooms from the soup object or not.不确定您是否想要soup对象中的data-bedrooms I did some cursory checking of the output produce and was able to reason that the data-* items you mentioned were attributes, rather than tags.我对输出结果做了一些粗略的检查,并且能够data-*你提到的data-*项目是属性,而不是标签。 If doc structure is consistent, you could probably locate the respective tag associated with the attribute, and make finding these more efficient:如果 doc 结构是一致的,您可能可以找到与属性关联的相应标签,并使查找这些更有效:

import re
# regex pattern for attribs
data_tag_pattern = re.compile('^data\-')

# Create list of attribs
attribs_wanted = "data-bedrooms data-rateminimumdata-rateunits data-rawlistingnumber data-requestuuid data-searchpdptab data-serverday data-showbookingphone".split()


# Search entire tree
for item in soup.findAll():
    # Use descendants to recurse downwards
    for child in item.descendants:
        try:
            for attribute in child.attrs:
                if data_tag_pattern.match(attribute) and attribute in attribs_wanted:
                    print("{}: {}".format(attribute, child[attribute]))
        except AttributeError:
            pass

This will produce output as so:这将产生如下输出:

data-showbookingphone: False
data-bedrooms: 3
data-requestuuid: 2b6f4d21-8b04-403d-9d25-0a660802fb46
data-serverday: 18
data-rawlistingnumber: 576329
data-searchpdptab: C

hth!嗯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM