[英]Beautiful soup returns nothing
This is the HTML code: 这是HTML代码:
<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>
I am trying to print 42263 - Unencrypted Telnet Server
using Beautiful Soup but the output is an empty element ie, []
我正在尝试打印
42263 - Unencrypted Telnet Server
使用Beautiful Soup的42263 - Unencrypted Telnet Server
但是输出为空元素,即[]
This is my Python code: 这是我的Python代码:
from bs4 import BeautifulSoup
import csv
import urllib.request as urllib2
with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
divs = soup.find_all('div', attrs={'background':'#fdc431'})
print(divs)
Solution with regexes : 正则表达式的解决方案:
from bs4 import BeautifulSoup
import re
with open(r"C:\Users\sourabhk076\Documents\CBS_1.html") as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
Let's find the div that matches the following regular expression: background:\\s*#fdc431;
让我们找到与以下正则表达式匹配的div:
background:\\s*#fdc431;
. 。
\\s
matches a single Unicode whitespace character. \\s
匹配单个Unicode空格字符。 I assumed that there can be 0 or more whitespaces so I added the *
modifier to match 0 or more repetitions of the preceding RE. 我假设可以有0个或多个空格,所以我添加了
*
修饰符以匹配前面的RE的0个或多个重复。 You can read more about regexes here as they sometimes come in handy. 您有时可以在这里阅读有关正则表达式的更多信息。 I also recommend you this online regex tester .
我也建议您使用此在线正则表达式测试器 。
div = soup.find('div', attrs={'style': re.compile(r'background:\s*#fdc431;')})
This however is equivalent to: 但是,这等效于:
div = soup.find('div', style=re.compile(r'background:\s*#fdc431;'))
You can read about that in the official documentation of BeautifulSoup 您可以在BeautifulSoup的官方文档中阅读有关内容
Worth reading are also the sections about the kinds of filters you can provide to the find
and other similar methods. 值得一读的是有关可以为
find
和其他类似方法提供的筛选器类型的部分。
You can supply either a string, regular expression, list, True
or a function, as shown by Keyur Potdar in his anwser. 您可以提供字符串,正则表达式,列表,
True
或函数,如Keyur Potdar在其anwser中所示。
Assuming the div exists we can get its text by: 假设div存在,我们可以通过以下方式获取其文本:
>>> div.text
'42263 - Unencrypted Telnet Server'
background
is not an attribute of the div
tag. background
不是div
标签的属性。 The attributes of the div
tag are: div
标签的属性是:
{'xmlns': '', 'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}
So, either you'll have to use 因此,要么必须使用
soup.find_all('div', attrs={'style': 'box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;'}
or, you can use the lambda
function to check if background: #fdc431
is in the style
attribute value, like this: 或者,您可以使用
lambda
函数检查background: #fdc431
是否在style
属性值中,如下所示:
soup = BeautifulSoup('<div xmlns="" style="box-sizing: border-box; width: 100%; margin: 0 0 10px 0; padding: 5px 10px; background: #fdc431; font-weight: bold; font-size: 14px; line-height: 20px; color: #fff;">42263 - Unencrypted Telnet Server</div>', 'html.parser')
print(soup.find(lambda t: t.name == 'div' and 'background: #fdc431' in t['style']).text)
# 42263 - Unencrypted Telnet Server
or, you can use RegEx, as shown by Jatimir in his answer . 或者,您可以使用RegEx,如Jatimir在他的答案中所示 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.