简体   繁体   English

Python Web爬网-html解析

[英]Python Web Scraping - html parsing

I'm trying to extract system status messages from nasdaq website. 我正在尝试从nasdaq网站提取系统状态消息。 Here is the part of page source: 这是页面源代码的一部分:

</script>
<h2>System Status Messages</h2>
<div id='divSSTAT'>
<div class="genTable">
<table style="width: 100%">
<colgroup>
<col class="gtcol1"></col>
<col class="gtcol2"></col>
<col class="gtcol3"></col>
</colgroup>
<tr>
<th class="gtcol1" style="width: 10%">Time</th>
<th class="gtcol2" style="width: 25%">Market</th>
<th class="gtcol3">Status</th>
</tr>
<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>
</table>
</div>
</div>

Want the output like this: 想要这样的输出:

System Status Messages
11:56:46 Systems are operating normally

Here is what i do to extract the page content: 这是我提取页面内容所要做的:

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
soup.find_all(["h2","tr"])

This gives a lot of unwanted content. 这产生了很多不需要的内容。 What's the best way to clean it,expecially the lines that contains the actual system message? 清除它的最佳方法是什么,尤其是包含实际系统消息的行? right now it's like this... 现在就是这样...

<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>

Thanks! 谢谢!

You can iterate over the td tags 您可以遍历td标签

from bs4 import BeautifulSoup as soup
s = soup(content, 'html.parser')
_start, *_, _end = [i.text for i in s.find_all('td')]
results = f'{s.h2.text}\n{_start} {_end}'
print(results)

Output: 输出:

System Status Messages
11:56:46 ET Systems are operating normally

If you do not want ET included in the output, you can use re.sub : 如果您不希望在输出中包含ET ,则可以使用re.sub

import re
...
results = f'{s.h2.text}\n{re.sub(" [A-Z]+", "", _start)} {_end}'

Output: 输出:

System Status Messages
11:56:46 Systems are operating normally

In the following you could split the 3 selector combinations into 3 individual select_one('indiv selector combination here') selections. 在下面的代码中,您可以将3个选择器组合分为3个单独的select_one(“此处为indiv选择器组合”)选择。 Just showing for sake of interest combined. 只是为了显示兴趣而组合在一起。 Note that longer selectors and those using quantifiers are slightly less performant in css terms. 请注意,较长的选择器和使用量词的选择器在CSS方面的性能稍差。

import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus'
res = requests.get(url)
soup = bs(res.content,'lxml')
print(' '.join([item.text for item in soup.select('#content h2:nth-of-type(1), #divSSTAT .tddateWidth, #divSSTAT td:nth-of-type(3)')]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM