简体   繁体   中英

Python Web Scraping - html parsing

I'm trying to extract system status messages from nasdaq website. Here is the part of page source:

</script>
<h2>System Status Messages</h2>
<div id='divSSTAT'>
<div class="genTable">
<table style="width: 100%">
<colgroup>
<col class="gtcol1"></col>
<col class="gtcol2"></col>
<col class="gtcol3"></col>
</colgroup>
<tr>
<th class="gtcol1" style="width: 10%">Time</th>
<th class="gtcol2" style="width: 25%">Market</th>
<th class="gtcol3">Status</th>
</tr>
<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>
</table>
</div>
</div>

Want the output like this:

System Status Messages
11:56:46 Systems are operating normally

Here is what i do to extract the page content:

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
soup.find_all(["h2","tr"])

This gives a lot of unwanted content. What's the best way to clean it,expecially the lines that contains the actual system message? right now it's like this...

<tr class='sstatNone' ><td class="tddateWidth" style="white-space: nowrap;">11:56:46 ET</td><td class="tdmarketwidth">NASDAQ<br>BX<br>Post - Trade<br>PSX<br>NASDAQ Options<br>BX Options<br>PHLX<br>NASDAQ Futures<br>ISE<br>GEMX<br>MRX</td><td valign="top">Systems are operating normally</td></tr>

Thanks!

You can iterate over the td tags

from bs4 import BeautifulSoup as soup
s = soup(content, 'html.parser')
_start, *_, _end = [i.text for i in s.find_all('td')]
results = f'{s.h2.text}\n{_start} {_end}'
print(results)

Output:

System Status Messages
11:56:46 ET Systems are operating normally

If you do not want ET included in the output, you can use re.sub :

import re
...
results = f'{s.h2.text}\n{re.sub(" [A-Z]+", "", _start)} {_end}'

Output:

System Status Messages
11:56:46 Systems are operating normally

In the following you could split the 3 selector combinations into 3 individual select_one('indiv selector combination here') selections. Just showing for sake of interest combined. Note that longer selectors and those using quantifiers are slightly less performant in css terms.

import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.nasdaqtrader.com/Trader.aspx?id=MarketSystemStatus'
res = requests.get(url)
soup = bs(res.content,'lxml')
print(' '.join([item.text for item in soup.select('#content h2:nth-of-type(1), #divSSTAT .tddateWidth, #divSSTAT td:nth-of-type(3)')]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM