Python Web爬网-html解析

Question

I'm trying to extract system status messages from nasdaq website. 我正在尝试从nasdaq网站提取系统状态消息。 Here is the part of page source: 这是页面源代码的一部分：

</script>
<h2>System Status Messages</h2>
<div id='divSSTAT'>
<div class="genTable">
<table style="width: 100%">
<colgroup>
<col class="gtcol1"></col>
<col class="gtcol2"></col>
<col class="gtcol3"></col>
</colgroup>
<tr>
<th class="gtcol1" style="width: 10%">Time</th>
<th class="gtcol2" style="width: 25%">Market</th>
<th class="gtcol3">Status</th>
</tr>
<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>
</table>
</div>
</div>

Want the output like this: 想要这样的输出：

System Status Messages
11:56:46 Systems are operating normally

Here is what i do to extract the page content: 这是我提取页面内容所要做的：

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
soup.find_all(["h2","tr"])

This gives a lot of unwanted content. 这产生了很多不需要的内容。 What's the best way to clean it,expecially the lines that contains the actual system message? 清除它的最佳方法是什么，尤其是包含实际系统消息的行？ right now it's like this... 现在就是这样...

<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>

Thanks! 谢谢！

Answer 1

You can iterate over the td tags 您可以遍历td标签

from bs4 import BeautifulSoup as soup
s = soup(content, 'html.parser')
_start, *_, _end = [i.text for i in s.find_all('td')]
results = f'{s.h2.text}\n{_start} {_end}'
print(results)

Output: 输出：

System Status Messages
11:56:46 ET Systems are operating normally

If you do not want ET included in the output, you can use re.sub : 如果您不希望在输出中包含ET ，则可以使用re.sub ：

import re
...
results = f'{s.h2.text}\n{re.sub(" [A-Z]+", "", _start)} {_end}'

Output: 输出：

System Status Messages
11:56:46 Systems are operating normally

Answer 2

In the following you could split the 3 selector combinations into 3 individual select_one('indiv selector combination here') selections. 在下面的代码中，您可以将3个选择器组合分为3个单独的select_one（“此处为indiv选择器组合”）选择。 Just showing for sake of interest combined. 只是为了显示兴趣而组合在一起。 Note that longer selectors and those using quantifiers are slightly less performant in css terms. 请注意，较长的选择器和使用量词的选择器在CSS方面的性能稍差。

import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus'
res = requests.get(url)
soup = bs(res.content,'lxml')
print(' '.join([item.text for item in soup.select('#content h2:nth-of-type(1), #divSSTAT .tddateWidth, #divSSTAT td:nth-of-type(3)')]))

Python Web爬网-html解析

问题描述

2 个解决方案

解决方案1
1 2019-01-08 16:53:20

解决方案2
1 2019-01-08 19:12:08

Python Web爬网-html解析

问题描述

2 个解决方案

解决方案1 1 2019-01-08 16:53:20

解决方案2 1 2019-01-08 19:12:08

解决方案1
1 2019-01-08 16:53:20

解决方案2
1 2019-01-08 19:12:08