简体   繁体   English

美丽的汤什么也没有回报

[英]Beautiful soup returns nothing

This is the HTML code: 这是HTML代码:

<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>

I am trying to print 42263 - Unencrypted Telnet Server using Beautiful Soup but the output is an empty element ie, [] 我正在尝试打印42263 - Unencrypted Telnet Server使用Beautiful Soup的42263 - Unencrypted Telnet Server但是输出为空元素,即[]

This is my Python code: 这是我的Python代码:

from bs4 import BeautifulSoup
import csv
import urllib.request as urllib2

with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
    soup = BeautifulSoup(fp.read(), 'html.parser')

divs = soup.find_all('div', attrs={'background':'#fdc431'})

print(divs)

Solution with regexes : 正则表达式的解决方案:

from bs4 import BeautifulSoup
import re

with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
    soup = BeautifulSoup(fp.read(), 'html.parser')

Let's find the div that matches the following regular expression: background:\\s*#fdc431; 让我们找到与以下正则表达式匹配的div: background:\\s*#fdc431; . \\s matches a single Unicode whitespace character. \\s匹配单个Unicode空格字符。 I assumed that there can be 0 or more whitespaces so I added the * modifier to match 0 or more repetitions of the preceding RE. 我假设可以有0个或多个空格,所以我添加了*修饰符以匹配前面的RE的0个或多个重复。 You can read more about regexes here as they sometimes come in handy. 您有时可以在这里阅读有关正则表达式的更多信息。 I also recommend you this online regex tester . 我也建议您使用此在线正则表达式测试器

div = soup.find('div', attrs={'style': re.compile(r'background:\s*#fdc431;')})

This however is equivalent to: 但是,这等效于:

div = soup.find('div', style=re.compile(r'background:\s*#fdc431;'))

You can read about that in the official documentation of BeautifulSoup 您可以在BeautifulSoup的官方文档中阅读有关内容

Worth reading are also the sections about the kinds of filters you can provide to the find and other similar methods. 值得一读的是有关可以为find和其他类似方法提供的筛选器类型的部分。

You can supply either a string, regular expression, list, True or a function, as shown by Keyur Potdar in his anwser. 您可以提供字符串,正则表达式,列表, True或函数,如Keyur Potdar在其anwser中所示。

Assuming the div exists we can get its text by: 假设div存在,我们可以通过以下方式获取其文本:

>>> div.text
'42263 - Unencrypted Telnet Server'

background is not an attribute of the div tag. background不是div标签的属性。 The attributes of the div tag are: div标签的属性是:

{'xmlns': '', 'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}

So, either you'll have to use 因此,要么必须使用

soup.find_all('div', attrs={'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}

or, you can use the lambda function to check if background: #fdc431 is in the style attribute value, like this: 或者,您可以使用lambda函数检查background: #fdc431是否在style属性值中,如下所示:

soup = BeautifulSoup('<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>', 'html.parser')
print(soup.find(lambda t: t.name == 'div' and 'background: #fdc431' in t['style']).text)
# 42263 - Unencrypted Telnet Server

or, you can use RegEx, as shown by Jatimir in his answer . 或者,您可以使用RegEx,如Jatimir在他的答案中所示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM