简体   繁体   中英

Extract from webpage using bs4 and Python

How am I able to extract the number 1 from "Current stream number: 1" from the website below, see my attempt so far using python and bs4, not successful

the page source i am trying to scrape

<head><link href="basic.css" rel="stylesheet" type="text/css"></head>
<body>
<p><b>STATUS</b><br>
<p><b>Device information:</b><br>
Hardware type:  
Exstreamer 110
 (ID 20)<br>
<br>
Firmware: Streaming Client<br>
FW version: B2.17&nbsp;-&nbsp;31/05/2010 (dd/mm/yyyy)<br>
WEB version: 04.00<br>
Bootloader version: 99.19<br>
Setup version: 01.02<br>
Sg version: A8.05&nbsp;-&nbsp;May 31 2010<br>
Fs version: A2.05&nbsp;-&nbsp;31/05/2010 (dd/mm/yyyy)<br>
<p><b>System status:</b><br>
Ticks: 1588923494 ms<br>
Uptime: 10178858 s<br>
<p><b>Streaming status:</b><br>
Volume: 90%<br>
Shuffle:   Off<br>
Repeat:   Off<br>
Output peak level L: -63dBFS<br>
Output peak level R: -57dBFS<br>
Buffer level: 65532 bytes<br>
RTP decoder latency: 0 ms; average 0 ms<br>
Current stream number:   1   <br>
Current URL: http://listen.qkradio.com.au:8382/listen.mp3<br>
Current channel: 0<br>
Stream bitrate: 32 kbps<br>

Code:

from bs4 import BeautifulSoup
import urllib2
import lxml

SERVER = 'http://xx.xx.xx.xx:8080/ixstatus.html'
authinfo = urllib2.HTTPPasswordMgrWithDefaultRealm()
authinfo.add_password(None, SERVER, 'user', 'password')
page = 'http://xxx.xxx.xxx.xxx:8080/ixstatus.html'
handler = urllib2.HTTPBasicAuthHandler(authinfo)
myopener = urllib2.build_opener(handler)
opened = urllib2.install_opener(myopener)
output = urllib2.urlopen(page)
#print output.read()
soup = BeautifulSoup(output.read(), "lxml")
#print(soup)

print "stream number:", soup.select('Current stream number')[0].text

Your call to select makes BS4 use CSS selectors to find something that doesn't exist. an <number> inside a <stream> that's inside a <Current> element.

Since the html code has no class or id attributes which you can use to locate the data you want. Your (probably) best bet is to look through paragraphs and find sub strings like: Current stream number: some_number using regular expressions.

Here's how I'd do it:

import re
import bs4

page = "html code to scrape"

# this pattern will be used to find data we want
pattern = r'\s*Current\s+stream\s+number:\s*(\d+)'

soup = bs4.BeautifulSoup(page, 'lxml')

paragraphs = soup.findAll('p')
data = []
for para in paragraphs:
    found = re.finditer(pattern, para.text, re.IGNORECASE);

    data.extend([x.group(1) for x in found])


print(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM