简体   繁体   English

使用bs4和Python从网页中提取

[英]Extract from webpage using bs4 and Python

How am I able to extract the number 1 from "Current stream number: 1" from the website below, see my attempt so far using python and bs4, not successful 我如何从以下网站的“当前流编号:1”中提取数字1,到目前为止我使用python和bs4进行的尝试未成功

the page source i am trying to scrape 我要抓取的页面源

<head><link href="basic.css" rel="stylesheet" type="text/css"></head>
<body>
<p><b>STATUS</b><br>
<p><b>Device information:</b><br>
Hardware type:  
Exstreamer 110
 (ID 20)<br>
<br>
Firmware: Streaming Client<br>
FW version: B2.17&nbsp;-&nbsp;31/05/2010 (dd/mm/yyyy)<br>
WEB version: 04.00<br>
Bootloader version: 99.19<br>
Setup version: 01.02<br>
Sg version: A8.05&nbsp;-&nbsp;May 31 2010<br>
Fs version: A2.05&nbsp;-&nbsp;31/05/2010 (dd/mm/yyyy)<br>
<p><b>System status:</b><br>
Ticks: 1588923494 ms<br>
Uptime: 10178858 s<br>
<p><b>Streaming status:</b><br>
Volume: 90%<br>
Shuffle:   Off<br>
Repeat:   Off<br>
Output peak level L: -63dBFS<br>
Output peak level R: -57dBFS<br>
Buffer level: 65532 bytes<br>
RTP decoder latency: 0 ms; average 0 ms<br>
Current stream number:   1   <br>
Current URL: http://listen.qkradio.com.au:8382/listen.mp3<br>
Current channel: 0<br>
Stream bitrate: 32 kbps<br>

Code: 码:

from bs4 import BeautifulSoup
import urllib2
import lxml

SERVER = 'http://xx.xx.xx.xx:8080/ixstatus.html'
authinfo = urllib2.HTTPPasswordMgrWithDefaultRealm()
authinfo.add_password(None, SERVER, 'user', 'password')
page = 'http://xxx.xxx.xxx.xxx:8080/ixstatus.html'
handler = urllib2.HTTPBasicAuthHandler(authinfo)
myopener = urllib2.build_opener(handler)
opened = urllib2.install_opener(myopener)
output = urllib2.urlopen(page)
#print output.read()
soup = BeautifulSoup(output.read(), "lxml")
#print(soup)

print "stream number:", soup.select('Current stream number')[0].text

Your call to select makes BS4 use CSS selectors to find something that doesn't exist. 您的select调用使BS4使用CSS选择器查找不存在的内容。 an <number> inside a <stream> that's inside a <Current> element. <Current>元素内的<stream>中的<number>

Since the html code has no class or id attributes which you can use to locate the data you want. 由于html代码没有class或id属性,因此可以用来查找所需的数据。 Your (probably) best bet is to look through paragraphs and find sub strings like: Current stream number: some_number using regular expressions. 您(最好)的最佳选择是浏览段落并查找子字符串,例如: Current stream number: some_number使用正则表达式的Current stream number: some_number

Here's how I'd do it: 这是我的处理方式:

import re
import bs4

page = "html code to scrape"

# this pattern will be used to find data we want
pattern = r'\s*Current\s+stream\s+number:\s*(\d+)'

soup = bs4.BeautifulSoup(page, 'lxml')

paragraphs = soup.findAll('p')
data = []
for para in paragraphs:
    found = re.finditer(pattern, para.text, re.IGNORECASE);

    data.extend([x.group(1) for x in found])


print(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM