[英]Extract HTML tags and data from text
I'm using Python 2.7 to try and do a simple call to a website to extract the HTML data, which I've managed with the code below. 我正在使用Python 2.7尝试对网站进行简单调用以提取HTML数据,这些数据已通过下面的代码进行管理。
import requests
from HTMLParser import HTMLParser
name = "Mark"
surname = "Jacobs"
def req_getPageHTML(nume, prenume):
url = "http://sample.com/page.aspx&Name=" + name + "&surname=" + surname
response = requests.get(url).text
return response
page_code = req_getPageHTML(nume, prenume)
htmlp = HTMLParser()
print htmlp.feed(page_code)
The next thing that I want to do is somehow extract or parse this UNICODE
response ( print type(page_code)
returns UNICODE
) to somehow extract some information from it. 我想做的下一件事是以某种方式提取或解析此
UNICODE
响应( print type(page_code)
返回UNICODE
)以某种方式从中提取一些信息。
Specifically, from the below sample HTML which I can get back, I want to extract the values (numbers which are slightly inset in the below HTML code and also prefixed with a >
- this doesn't exist in the HTML code, it's just for being easily identified by you guys). 具体来说,我想从下面的示例HTML中提取值(在下面的HTML代码中稍微插入的数字,并以
>
开头)-HTML代码中不存在,仅用于容易被你们识别)。
...
<tr class="tr1" OnClick="lockBac();">
<td class="tdB" rowspan="2" nowrap="nowrap">1</td>
<td class="tdB" rowspan="2" nowrap="nowrap">Jacobs D <br/>Mark</td>
<td class="tdB" rowspan="2" align="Center">Math speciality</td>
<td class="tdB" rowspan="2" align="Center">Advanced User</td>
> <td class="tdB" rowspan="2" align="Center">6.95</td>
> <td class="tdB" rowspan="2" align="Center">7.9</td>
> <td class="tdB" rowspan="2" align="Center">7.9</td>
<td class="tdB" colspan="4" align="Center"></td>
<td class="tdB" rowspan="2" align="Center">English</td>
<td class="tdB" rowspan="2" align="Center">B2-B2-B2-B2-B2</td>
<td class="tdB" colspan="3" align="Center">Mathematics MATH-INFO</td>
<td class="tdB" colspan="3" align="Center">Informatics</td>
<td bgcolor="lightgreen" class="tdB" rowspan="2" align="Center"></td>
<td class="tdB" rowspan="2" align="Center">8.88</td>
<td class="tdB" rowspan="2" align="Center">Success</td>
</tr>
<tr class="tr1" OnClick="lockBac();">
<td class="tdB"></td>
<td class="tdB"></td>
<td class="tdB"></td>
<td class="tdB"></td>
> <td class="tdB">9.35</td>
> <td class="tdB"></td>
> <td class="tdB">9.35</td>
> <td class="tdB">9.4</td>
<td class="tdB"></td>
> <td class="tdB">9.4</td>
</tr>
...
What these numbers represent is Exam scores, which I will later put in a DB. 这些数字代表的是考试成绩,我稍后将其放入数据库中。
Now, I'm trying to look for an efficient way to extract these numbers as I would prefer to leave parsing the text to look for each element (manually with SUBSTR
and so on) as a last option. 现在,我正在尝试寻找一种有效的方法来提取这些数字,因为我宁愿保留解析文本以查找每个元素(
SUBSTR
使用SUBSTR
等)作为最后的选择。
I did come across HTMLParser, which as you can see is also imported into my code, but the bottom print
returns None
. 我确实遇到了HTMLParser,如您所见,它也已导入到我的代码中,但最下面的
print
返回None
。
I was under the impression that I can use this library to parse the text received from response
and there would be an easier way to specify a tag ID (or something similar) and extract the relevant information from it (like it is shown in the HTMLParser examples section ), but I can't get the necessary information I want from using the feed
method. 我的印象是,我可以使用该库来解析从
response
接收到的文本,并且有一种更简单的方法来指定标签ID(或类似名称)并从中提取相关信息(如HTMLParser中所示)范例部分 ),但是我无法通过使用feed
方法获得所需的信息。
Maybe I'm not understanding this correctly and maybe I'm not using the appropriate tool, so that is why I also explained my goal. 也许我没有正确理解这一点,也许我没有使用适当的工具,所以这就是为什么我也解释了我的目标的原因。
I would appreciate any help in correcting or pointing me into the right direction. 如果能帮助我纠正或指出正确的方向,我将不胜感激。
Not sure how to work with what you have tried, but I have a different method. 不知道如何使用您尝试过的方法,但是我有另一种方法。
You can grab lxml , a python library that helps out with scraping XML and HTML. 您可以获取lxml ,这是一个有助于抓取XML和HTML的python库。 It seems Requests will also help out with this project.
似乎Requests也会对这个项目有所帮助。
page = requests.get('http://www.example.com')
tree = html.fromstring(page.text)
The tree
variable now contains all of the html document, which you can parse however you wish. tree
变量现在包含所有html文档,您可以根据需要进行解析。 Using Xpath would have something like 使用Xpath会有类似
scores = tree.xpath('//td[@class="tdB"]/text()')
Hope that helps. 希望能有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.