无法弄清楚如何使用漂亮的汤（Python）在 body 标签中抓取数据

Question

from bs4 import BeautifulSoup
import urllib
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.cell import get_column_letter

r = urllib.urlopen('https://www.vrbo.com/576329').read()
soup = BeautifulSoup(r)
rate = soup.find_all('body')

print rate
print type(soup)

I'm trying to capture values in containers such as data-bedrooms="3", specifically the values given in the quotations, but I have no idea what they are formally called or how to parse them.我试图在诸如 data-bedrooms="3" 之类的容器中捕获值，特别是引用中给出的值，但我不知道它们的正式名称或如何解析它们。

The below is a sample of part of the print out for the "body" so I know the values are there, the capturing the specific part is what I can't get:下面是“身体”部分打印输出的示例，所以我知道值在那里，捕获特定部分是我无法得到的：

data-ratemaximum="$260" data-rateminimum="$220" data-rateunits="night" data-rawlistingnumber="576329" data-requestuuid="73bcfaa3-9637-40a8-801c-ae86f93caf39" data-searchpdptab="C" data-serverday="18" data-showbookingphone="False" data-ratemaximum="$260" data-rateminimum="$220" data-rateunits="night" data-rawlistingnumber="576329" data-requestuuid="73bcfaa3-9637-40a8-801c-ae86f93caf39" data-searchpdptab="C" data-serverday="18" data-showbookingphone="False"

Answer 1

To obtain the value of an attribute used rate [ 'attr'], example:要获取属性使用率 ['attr'] 的值，例如：

 from bs4 import BeautifulSoup import urllib from openpyxl import Workbook from openpyxl.compat import range from openpyxl.cell import get_column_letter r = urllib.urlopen('https://www.vrbo.com/576329').read() soup = BeautifulSoup(r, "html.parser") rate = soup.find('body') print rate['data-ratemaximum'] print rate['data-rateunits'] print rate['data-rawlistingnumber'] print rate['data-requestuuid'] print rate['data-searchpdptab'] print rate['data-serverday'] print rate['data-searchpdptab'] print rate['data-showbookingphone'] print rate print type(soup)

Answer 2

You need to pick apart your result.你需要把你的结果分开。 It might be helpful to know that those things you seek are called attributes of a tag in HTML:知道您寻求的那些东西在 HTML 中称为标记的属性可能会有所帮助：

body_tag = rate[0]
data_bedrooms = body_tag.attrs['data-bedrooms']

The code above assumes you only have one <body> -- if you have more you will need to use a for loop on rate .上面的代码假设您只有一个<body> - 如果您有更多，则需要在rate上使用for循环。 You'll also possibly want to convert the value to an integer with int() .您还可能希望使用int()将值转换为整数。

Answer 3

Not sure if you wanted only data-bedrooms from the soup object or not.不确定您是否只想要soup对象中的data-bedrooms 。 I did some cursory checking of the output produce and was able to reason that the data-* items you mentioned were attributes, rather than tags.我对输出结果做了一些粗略的检查，并且能够data-*你提到的data-*项目是属性，而不是标签。 If doc structure is consistent, you could probably locate the respective tag associated with the attribute, and make finding these more efficient:如果 doc 结构是一致的，您可能可以找到与属性关联的相应标签，并使查找这些更有效：

import re
# regex pattern for attribs
data_tag_pattern = re.compile('^data\-')

# Create list of attribs
attribs_wanted = "data-bedrooms data-rateminimumdata-rateunits data-rawlistingnumber data-requestuuid data-searchpdptab data-serverday data-showbookingphone".split()


# Search entire tree
for item in soup.findAll():
    # Use descendants to recurse downwards
    for child in item.descendants:
        try:
            for attribute in child.attrs:
                if data_tag_pattern.match(attribute) and attribute in attribs_wanted:
                    print("{}: {}".format(attribute, child[attribute]))
        except AttributeError:
            pass

This will produce output as so:这将产生如下输出：

data-showbookingphone: False
data-bedrooms: 3
data-requestuuid: 2b6f4d21-8b04-403d-9d25-0a660802fb46
data-serverday: 18
data-rawlistingnumber: 576329
data-searchpdptab: C

hth!嗯！

无法弄清楚如何使用漂亮的汤（Python）在 body 标签中抓取数据

问题描述

3 个解决方案

解决方案1
2 2016-06-19 03:16:49

解决方案2
0 已采纳 2016-06-19 02:36:50

解决方案3
-1 2016-06-19 03:26:22

无法弄清楚如何使用漂亮的汤（Python）在 body 标签中抓取数据

问题描述

3 个解决方案

解决方案1 2 2016-06-19 03:16:49

解决方案2 0 已采纳 2016-06-19 02:36:50

解决方案3 -1 2016-06-19 03:26:22

解决方案1
2 2016-06-19 03:16:49

解决方案2
0 已采纳 2016-06-19 02:36:50

解决方案3
-1 2016-06-19 03:26:22