简体   繁体   English

使用python从网页读取值

[英]Read value from web page using python

I am trying to read a value in a html page into a variable in a python script. 我正在尝试将html页面中的值读入python脚本中的变量。 I have already figured out a way of downloading the page to a local file using urllib and could extract the value with a bash script but would like to try it in Python. 我已经想出了一种使用urllib将页面下载到本地文件的方法,并且可以使用bash脚本提取值,但想在Python中进行尝试。

import urllib
urllib.urlretrieve('http://url.com', 'page.htm')

The page has this in it: 该页面包含以下内容:

<div name="mainbody" style="font-size: x-large;margin:auto;width:33;">
<b><a href="w.cgi?hsn=10543">Plateau (19:01)</a></b>
<br/> Wired: 17.4
<br/>P10 Chard: 16.7
<br/>P1 P. Gris: 17.1
<br/>P20 Pinot Noir: 15.8-
<br/>Soil Temp : Error
<br/>Rainfall: 0.2<br/>
</div>

I need the 17.4 value from the Wired: line 我需要Wired:行中的17.4值

Any suggestions? 有什么建议么?

Thanks 谢谢

Start with not using urlretrieve() ; 从不使用urlretrieve() you want the data, not a file. 您需要数据,而不是文件。

Next, use a HTML parser. 接下来,使用HTML解析器。 BeautifulSoup is great for extracting text from HTML. BeautifulSoup非常适合从HTML提取文本。

Retrieving the page with urllib2 would be: 使用urllib2检索页面将是:

from urllib2 import urlopen

response = urlopen('http://url.com/')

then read the data into BeautifulSoup: 然后将数据读入BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))

The from_encoding part there will tell BeautifulSoup what encoding the web server told you to use for the page; 那里的from_encoding部分将告诉BeautifulSoup网络服务器告诉您该页面使用什么编码; if the web server did not specify this then BeautifulSoup will make an educated guess for you. 如果Web服务器未指定此名称,则BeautifulSoup将为您做出有根据的猜测。

Now you can search for your data: 现在您可以搜索数据:

for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
    if 'Wired:' in line:
        value = float(line.partition('Wired:')[2])
        print value

For your demo HTML snippet that gives: 对于您的演示HTML代码段,它提供了:

>>> for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
...     if 'Wired:' in line:
...         value = float(line.partition('Wired:')[2])
...         print value
... 
17.4

This is called web scraping and there's a very popular library for doing this in Python, it's called Beautiful Soup : 这被称为网页抓取,并且有一个非常流行的库可以在Python中执行此操作,它被称为Beautiful Soup

http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/

If you'd like to do it with urllib/urllib2, you can accomplish that using regular expressions : 如果您想使用urllib / urllib2做到这一点, regular expressions可以使用regular expressions来完成:

http://docs.python.org/2/library/re.html http://docs.python.org/2/library/re.html

Using regex, you basically use the surrounding context of your desired value as the key, then strip the key away. 使用正则表达式,您基本上将所需值的周围上下文用作键,然后将键剥离。 So in this case you might match from "Wired: " to the next newline character, then strip away the "Wired: " and the newline character. 因此,在这种情况下,您可以将“ Wired:”与下一个换行符匹配,然后剥离“ Wired:”和换行符。

您可以使用find或正则表达式逐行浏览文件以检查所需的值,也可以考虑使用scrapy来检索和解析链接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM